Method for de novo detection of sequences in nucleic acids: target sequencing by fragmentation

ABSTRACT

The present invention provides a method for determining nucleic acid sequences of a template nucleic acid that requires no prior knowledge of the nucleic acid sequence present in the template nucleic acid. The method is based on combining information about the mass of a fragment, the mass of any one nucleotide and the combinations thereof, and the sequence specificity of a nucleotide cutter, either enzymatic or chemical cutter, to determine a sequence of a nucleic acid fragment. This method allows for de novo detection of sequences in a target nucleic acid without requiring any prior sequence information. This method is called Partial Sequencing by Fragmentation (PSBF) and it works by fragmenting a target into oligo- or polynucleotides whose masses or lengths are uniquely associated with known sequences. The identities of these sequences are determined solely by the specific fragmentation method used, and are always independent of the target. PSBF can be implemented using electrophoresis, mass spectrometry or any other method that can be used to distinguish the size of the cut nucleic acid sequence fragments.

RELATED APPLICATIONS

The present application is a Continuation of U.S. Utility applicationSer. No. 11/547,765, which is a 371 National Stage of InternationalApplication No. PCT/US2005/011812 filed on Apr. 8, 2005, whichdesignated the U.S., and which claims benefit under 35 U.S.C. § 119(e)of the U.S. provisional application Ser. No. 60/563,283, filed Apr. 9,2004 and Ser. No. 60/565,284, filed Apr. 26, 2004, the contents of whichare incorporated herein by reference in their entirety.

FIELD OF INVENTION

The present invention is directed to a method for determining thenucleic acid sequences of a target nucleic acid based on the size ofparticular fragments.

BACKGROUND OF THE INVENTION

There are several applications where it is desirable to quickly andaccurately detect the presence of one or more known sequences in atarget nucleic acid. Typically this is done using hybridization arrays,PCR, or short-range Sanger sequencing. All of these methods, however,require that one specify which sequences are to be detected(hybridization array), or know a priori the primer sequences in thetarget (PCR, Sanger).

Sanger sequencing reactions and related methods are usually analyzed byelectrophoresis or mass spectrometry. Matrix-assisted laser desorptionionization time-of-flight mass spectrometry (MALDI-TOF MS) has twoprimary advantages over electrophoretic methods for sequencing nucleicacids: high speed and high resolution (Nordhoff et al. 2000; Koster etal. 1996). The main disadvantage of MALDI-TOF mass spectrometry in thisregard is its highly limited read length (15 to 40 bases), as comparedto electrophoresis, which routinely gives read lengths of severalhundred bases.

Recently developed mass spectrometric methods for diagnosticresequencing of DNA utilize a controlled fragmentation of the target DNAsequence, usually several hundred bases in length, into many smallernon-overlapping oligonucleotides of less than fifteen bases (Elso et al.2002; Rodi et al. 2002). The mass spectrum of these fragments can bethought of as a fingerprint. These mass spectra, when compared againstcalculated spectra from a known reference sequence, can provide usefulsequence information about the target. These methods accomplishfragmentation by using chemical (von Wintzingerode et al. 2002) orenzymatic means (Hartmer et al. 2003), and are specific to amononucleotide (e.g. cleavage after every dA residue).

Mononucleotide-specific fragmentation methods are inefficient, andtypically destroy much of the sequence information in the target DNA inthe process of generating oligonucleotides that are short enough foranalysis by mass spectrometry (Zabeau et al. 2000). This is becauseabout 40%-50% of the target DNA is reduced to fragments four nucleotidesor shorter in a typical cleavage reaction, which are too small to beinformative using a MALDI-TOF instrument.

Cleavage techniques that are specific to dinucleotide sequences havebeen developed to overcome the limitations of mononucleotide specificfragmentation (Stanton, Jr. et al. 2003). A specific dinucleotidecleavage reaction would be expected to produce fragments with an averagelength of sixteen bases, which is ideal for analysis by MALDI-TOF MS.These methods utilize chemically modified nucleotide analogs (Wolfe etal. 2003) or template-directed incorporation of dinucleotidetriphosphates by special polymerases (Kless 2001).

However, all these cleavage methods share a fundamental limitation:there is no way to determine the order of the bases in a fragment givenonly its length or molecular mass. This effectively means that existingfragmentation methods are limited to applications where a referencesequence is available so that possible fragment masses can be calculatedbeforehand (Bocker 2003).

It would be useful to develop methods for determining nucleic acidsequences that would require no prior sequence knowledge.

SUMMARY OF THE INVENTION

We have discovered a method for determining nucleic acid sequences of atemplate nucleic acid that requires no prior knowledge of the nucleicacid sequence present in the template nucleic acid. The method is basedon combining information about the mass of a fragment, the mass of anyone nucleotide and the combinations thereof, and the sequencespecificity of a nucleotide cutter, either enzymatic or chemical cutter,to determine a sequence of a nucleic acid fragment.

This method allows for de novo detection of sequences in a targetnucleic acid without requiring any prior sequence information. Thismethod is called Partial Sequencing by Fragmentation (PSBF) and it worksby fragmenting a target into oligo- or polynucleotides whose masses orlengths are uniquely associated with known sequences. The identities ofthese sequences are determined solely by the specific fragmentationmethod used, and are always independent of the target. PSBF can beimplemented using electrophoresis, mass spectrometry or any other methodthat can be used to distinguish the size of the cut nucleic acidsequence fragments.

The method of the present invention is useful in all applications wherethe analysis requires determination of sequence information of atemplate nucleic acids. Such applications include mutation detection,screening of biological samples such as tumor samples for nucleic acidvariations, pathogen and/or pathogen strain identification in anybiological sample material, determining sequence differences betweendifferent species, breeds, or strains and so forth.

Particularly useful application of the present method includessequencing of nucleotide repeats in any target template. Such repeatsinclude mono-nucleotide repeats or di- or tri-nucleotide repeats, thatare usually difficult to resolve using traditional Sanger sequencing orsequencing using nucleotide arrays. Therefore, the method of the presentinvention is particularly useful in combination with the othersequencing methods to resolve the low compositional complexity nucleicacid regions.

The method of the present invention also allows scanning of largenucleic acid regions, including partial or even whole chromosomes, forparticular sequences. When sequencing large nucleic acid fragments, useof frequent cutters is preferred to limit the number of fragments thatneed to be analyzed. For example, single nucleotide cutters can be usedto digest all other sequences in a template nucleic acid, which includesa chromosome, and only the nucleic acid fragments containing dATPsremain in the sample. The mass analysis of the fragments combined withthe knowledge of the mass of dATP and the fact that the sequences onlycontain stretches of nucleotide A, will allow scanning of A-richsegments. Further, if the fragment mass analysis is performed using massspectrometric tools, the number of fragments with same number of repeatscan be assessed from the surface area of the peak. This kind of scan hasapplications, for example, in determining the approximate number ofgenes in a particular chromosome or chromosomal region based on thepresence of poly-A tails.

In one embodiment, the invention provides a method of sequencingcomprising the steps of obtaining a nucleic acid template, which can beeither single-stranded or double-stranded template. Next, producing atranscript of the target template by using appropriate polymerase(s) andnucleotides selected for sequence-specific reactivity and molecularweight. Primers for the transcription can be random nucleotide primersor sequence-specific primers. For the method wherein no prior sequenceknowledge of the sequence is required, the primers are preferably randomprimers. The transcript is cleaved in a sequence-specific manner usingeither enzymatic or chemical cleavage methods. Cleavage should becomplete and produce only non-overlapping fragments in one reaction.Cleavage reaction with complex specificities may require multiplereactions. Such multiple reaction may be performed either simultaneouslyor serially. In the next step, the cleavage reaction products areanalyzed either by length or by mass, preferably by mass. However, thelength analysis can be used, particularly, when the resulting fragmentsare known to only consist of single nucleotide repeats. In the nextstep, using the combination of the mass/length of the fragment and thecleavage specificity of the nucleic acid cutter, one can calculate themolecular weights and sequences of all possible fragments that canresult from the cleavage (the fragment identity mapping). The mapping isdependent only on the cleavage reactions and nucleotides used and istotally independent of the sequence of the target. In the final step,the masses are compared with the fragment identity mapping to determineat least one subsequence that is present in the target nucleic acidsequence.

In another embodiment, the invention provides a method of obtainingoverlapping fragments to enable complete sequencing of a target nucleicacid. In this embodiment, several parallel transcription, digesting,fragment mass analyses are performed to produce at least 2, 5, 10, 15,20, 50, 100 up to at least 1000 different sets of fragments, preferablycovering all or most of the target sequence, and compiling the sequenceof the target based on overlapping fragments after determination of thesequence of the subsequences as described above. In this method,multicutters that cut less frequently are preferred to obtain relativelylonger subfragments to allow identification of overlapping fragments.

In one embodiment, the invention provides a method to scan largetemplates such as a complete or partial chromosome to identify regionsof interest. Such regions of interest include but are not limited to,for example, poly-A regions to estimate the number of genes in thechromosome or part of the chromosome by using identification of poly-Atails. In a method directed to detection of single nucleotide repeats, asingle nucleotide cutter is preferably used.

In another embodiment, the invention provides a method to scan largenucleic acid templates for specific, low complexity nucleotide repeats,either single, di, tri etc. nucleic acid repeats. In such embodiment,the nucleotide cutters are either di, tri, etc. nucleotiderepeat-specific.

In one embodiment, the invention provides a method of determining thenumber of nucleotide repeats in a sequence. The number of fragments withidentical sequence can be determined using the surface area of the massspectrometric peak.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a general overview of steps involved in the targetsequencing by fragmentation method of the present invention. Step 1involves obtaining target nucleic acid for partial target sequencing.The nucleic acid may be single or double stranded and no prior sequenceinformation about the target is necessary. In step 2, one creates atranscript of target using appropriate polymerase(s) and nucleotidesselected for sequence-specific reactivity and molecular weight. Primersfor the transcription reaction may be specific or random. In step 3, thetranscript is cleaved in a sequence-specific manner using enzymatic orchemical means or a combination of both, also a photocleavage can beused. Cleavage should be complete and produce only non-overlappingoligonucleotide fragments. Cleavages with complex specificities mayrequire multiple reactions, which may be performed either simultaneouslyor serially. In step 4, one analyzes the cleavage reaction products, forexample, by mass spectrometry to determine the molecular weights offragments. Peak quantification information can also be obtained but isnot required. Such quantification can show how many of any particularsequences are present in the target nucleic acid sequence. In step 5,one the molecular weights and sequences of all possible fragments thatcan result from Step 3 using nucleotide masses and cleavagespecificities (i.e. performs the Fragment Identity Mapping). The mappingis dependent only on the cleavage reactions and nucleotides used and istotally independent of the sequence of the target. Compare the massesobserved in Step 4 with the Fragment Identity Mapping to determine whatsubsequences are present in the target.

FIG. 2 shows an example using the present method, steps 1-5, ofmulticutter _(16/15) [inv(A.A)] using modified nucleotides and cleavagereactions described by Stanton Jr. et al (2003, U.S. Pat. No.6,610,492). Positions of incorporated modified nucleotides are indicatedby an asterisk (*) and positions where cleavage occurs are indicated byan inverted triangle (▾). Step 1 shows obtaining target nucleic acid forpartial target sequencing using a multicutter _(16/15) [inv(A.A)]. Instep 2, One performs a PCR amplification of the target using modifiednucleotides: dATP, 5-OH-dCTP, 7-deaza-7-nitro-dGTP, 5-OH-dUTP andappropriate polymerases. In step 3, the PCR products are cleaved usingKMnO4 and 3-Pyrrolidinol (only forward strand is cleaved). In step 4,one analyzes cleavage products, for example, by mass spectrometry. Instep 5, the observed masses are compared with the Fragment IdentityMapping for _(16/15) [inv(A.A)] to identify all the sequence fragmentspresent in the target nucleic acid (SEQ ID NO: 85).

FIG. 3 shows an example implementation of Steps 1-5 of multicutter_(4/3)[B.] using deoxy- and ribonucleotides. Positions of incorporatedribonucleotides are indicated by an asterisk (*) and positions wherecleavage occurs are indicated by and inverted triangle (▾). In step 1,target nucleic acid is obtained for partial sequencing using multicutter_(4/3) [B.]. In step 2, one creates a transcript using nucleotides:dATP, rCTP, rGTP, rTTP, and an appropriate polymerase. In step 3, thetranscript is cleaved using alkali or nonspecific RNases. In step 4, oneanalyzes the cleavage products, for example, by mass spectrometry. Instep 5, the observed masses are compared with the Fragment IdentityMapping for _(4/3) [B.] (SEQ ID NO: 86).

FIG. 4 shows structures that can be used with nucleotides to overcome adeficiency in the method of U.S. Pat. No. 6,566,059. This method usesrNTPs and 5′-amino-2′,5′-dideoxyribonucleotides (nNTPs) and aspresented, dinucleotides composed of two of the same nucleotide cannotbe cleaved.

FIG. 5 shows an example implementation of Steps 1-5 multicutter_(16/9)[B.H] using modified nucleotides described by Stanton Jr. et al(2003, U.S. Pat. No. 6,566,059) and modified nucleotides described inthe text. Positions of nucleotides with 2′-OH groups are indicated byand asterisk (*), positions of nucleotides with 5′-NH groups areindicated by (n) and positions where cleavage occurs are indicated byand inverted triangle (▾). In step 1, one obtains target DNA forsequencing using multicutter _(16/9) [B.H]. In step 2, a transcript iscreated using nucleotides: nATP, nrCTP, rGTP, nrTTP and an appropriatepolymerase. In step 3, one performs polymerase-mediated transcriptcleavage. In step 4, the cleavage products are analyzed by massspectrometry, and in step 5, the observed masses are again compared withthe Fragment Identity Mapping for _(16/9) [B.H] (SEQ ID NOS 87 & 3 aredisclosed respectively in order of appearance).

FIG. 6 shows structures of dinucleotide triphosphates 5′ppp-dNdN (left)and 5′ppp-rNrN (right) that can be used in accordance with the methodsof the present invention.

BRIEF DESCRIPTION OF THE TABLES

Table 1 shows the nucleotide abbreviations.

Table 2 shows permutations for generic nucleotides.

Tables 3A shows statistics for a single base cleavage multicutter andTable 3B shows all possible fragments at every L using the single basecutter.

Table 4A shows the statistics of a multicutter variation, that preserveshomopolymeric regions of the cleaved nucleotide and Table 4B shows thecorresponding possible fragments for L=5.

Table 5A shows the statistics of cleavage products using a method forcleaving at a specific dinucleotide composed of two different bases_(16/1)[A.C], and Table 5B shows all possible fragments for L=5 usingthis method.

Table 6 shows types of Fragment Identity Mappings.

Table 7A shows statistics for the multicutter _(16/15)-[inv(A.A)], andTable 7B shows fragments for L four through eight for the samemulticutter.

Table 8A shows the statistics for multicutter _(4/3)[B.] or_(16/12)[B.N.], and Table 8B shows the fragments for L four througheight for the same cutter.

Table 9A shows statistics of multicutter _(16/9)[C.M V.K T.T], and Table9B shows the fragments for L four through eight for the same cutter.

Table 10A shows statistics of multicutter, _(16/14)[inv(A.C C.A)], andTable 10B shows the fragments for L four through eight for the samecutter.

Table 11A shows statistics of multicutter _(16/13)[inv(A.C C.G G.A)],Table 11B shows the fragments for L four through eight for the samecutter.

Table 12A shows statistics of multicutter _(16/12)[inv(A.C C.G G.T T.A,and Table 12B shows the fragments for L four through eight for the samecutter.

Table 13A shows statistics of multicutter _(16/11)[inv(A.T K.M)]:24, andTable 13B shows the fragments for L four through eight for the samecutter.

Table 14A shows statistics of multicutter _(16/13)[C.A M.K K.N)], andTable 14B shows the fragments for L four through eight for the samecutter.

Table 15A shows statistics of multicutter _(16/9)[B.V], and Table 15Bshows the fragments for L four through eight for the same cutter.

Table 16A shows statistics of multicutter _(16/6)[C.A G.M T.V], andTable 16B shows the fragments for L four through six for the samecutter.

Table 17 shows nucleotide structures and molecular weights.

Table 18 shows nucleotides used to implement multicutter family_(16/15)[inv(α.α):4].

Tables 19A and 19B show the strict fragment identity mappings for eachmulticutter in the family described in table 18 (10-mers are disclosedas SEQ ID NOS 88-91, respectively in order of appearance).

Table 20 shows nucleotides used to implement multicutter family_(4/3)[inv(α.)]:4.

Table 21 shows Fragment Identity Mapping for multicutters in family_(4/3)[α. β. γ.]

Table 22 shows nucleotides used to implement multicutter family_(16/9)[inv(α.η η.β)].

Table 23 shows Fragment Identity Mapping for multicutter _(16/9)[B.V](nATP, nrCTP, nrGTP, rTTP).

Tables 24A and 24B show Fragment Identity Mapping for multicutter_(16/9)[B.H] (nATP, nrCTP, nrGTP, rTTP) (SEQ ID NOS 92-120 & 28-84 aredisclosed respectively in order of appearance).

Table 25 shows identification of mycobacterial 16S rDNA usingmulticutter family _(4/3)[inv(α.)]:4.

DETAILED DESCRIPTION OF THE INVENTION

Provided herein are methods for sequencing and detecting nucleic acidsusing techniques, such as mass spectrometry and gel electrophoresis,that are based upon molecular mass.

We have discovered a method for determining nucleic acid sequences of atemplate nucleic acid that requires no prior knowledge of the nucleicacid sequence present in the template nucleic acid. The method is basedon combining information about the mass of a fragment, the mass of anyone nucleotide and the combinations thereof, and the sequencespecificity of a nucleotide cutter, either enzymatic or chemical cutter,to determine a sequence of a nucleic acid fragment.

This method allows for de novo detection of sequences in a targetnucleic acid without requiring any prior sequence information. Thismethod is called Partial Sequencing by Fragmentation (PSBF) and it worksby fragmenting a target into oligo- or polynucleotides whose masses orlengths are uniquely associated with known sequences. The identities ofthese sequences are determined solely by the specific fragmentationmethod used, and are always independent of the target. PSBF can beimplemented using electrophoresis, mass spectrometry or any other methodthat can be used to distinguish the size of the cut nucleic acidsequence fragments.

The method of the present invention is useful in all applications wherethe analysis requires determination of sequence information of atemplate nucleic acids. Such applications include mutation detection,screening of biological samples such as tumor samples for nucleic acidvariations, pathogen and/or pathogen strain identification in anybiological sample material, determining sequence differences betweendifferent species, breeds, or strains and so forth.

Particularly useful application of the present method includessequencing of nucleotide repeats in any target template. Such repeatsinclude mono-nucleotide repeats or di- or tri-nucleotide repeats, thatare usually difficult to resolve using traditional Sanger sequencing orsequencing using nucleotide arrays. Therefore, the method of the presentinvention is particularly useful in combination with the othersequencing methods to resolve the low compositional complexity nucleicacid regions. The method can be used in combination with othersequencing methods to complement traditional sequencing, such asSanger-sequencing, which is often unable alone to determine the numbersingle nucleotide repeats in a target sequence.

The method of the present invention also allows scanning of largenucleic acid regions, including partial or even whole chromosomes, forparticular sequences. When sequencing large nucleic acid fragments, useof frequent cutters is preferred to limit the number of fragments thatneed to be analyzed. For example, single nucleotide cutters can be usedto digest all other sequences in a template nucleic acid, which includesa chromosome, and only the nucleic acid fragments containing dATPsremain in the sample. For example, the mass analysis of the fragmentscombined with the knowledge of the mass of dATP and the fact that thesequences only contain stretches of nucleotide A, will allow scanning ofA-rich segments. One can use this method to identify fragment having anytype of sequence pattern one is looking for. Further, if the fragmentmass analysis is performed using mass spectrometric tools, the number offragments with same number of repeats can be assessed from the surfacearea of the peak. This kind of scan has applications, for example, indetermining the approximate number of genes in a particular chromosomeor chromosomal region based on the presence of poly-A tails.

Accordingly, in one embodiment, the invention provides a method ofsequencing comprising the steps of obtaining a nucleic acid template,which can be either single-stranded or double-stranded template. Thenucleic acids can be isolated and/or purified using any known standardnucleic acid isolation and purification techniques. As used herein“nucleic acid” refers to polynucleotides such as deoxyribonucleic acid(DNA) and ribonucleic acid (RNA). The term should also be understood toinclude, as equivalents, derivatives, variants and analogs of either RNAor DNA made from nucleotide analogs, single (sense or antisense) anddouble-stranded polynucleotides. Deoxyribonucleotides includedeoxyadenosine, deoxycytidine, deoxyguanosine and deoxythymidine. ForRNA, the uracil base is uridine.

Next, a transcript of the target template is produced by usingappropriate polymerase(s) and nucleotides selected for sequence-specificreactivity and molecular weight. Useful polymerases include DNApolymerases, i.e. enzymes that replicate DNA using a DNA template,reverse transcriptases, enzymes that synthesize DNA using an RNAtemplate, and RNA polymerases, which synthesize RNA from a template DNA,including eukaryotic RNA polymerases I, II and III, each comprising twolarge subunits and 12-15 smaller subunits. RNA polymerase II is nativelyinvolved in the transcription of all protein genes and most snRNA genes,and thus the preferred RNA polymerase in the methods of the presentinvention. Alternatively, RNA polymerase I, which is natively located inthe nucleolus, transcribing rRNA genes except 5S rRNA can be used. RNApolymerase III, which is located outside the nucleolus, transcribing 5SrRNA, tRNA, U6 snRNA and some small RNA genes can also be used incertain applications of the invention. DNA polymerases and reversetranscriptases are preferred. For example, polymerases such as T3 and T7can also be used. All the polymerases are available to one skilled inthe art from various commercial sources. Selection of polymerase is aroutine exercise to a skilled artisan based upon the nature of thetemplate and the nucleotides that are incorporated into the synthesizedtranscript.

Useful “nucleotides” in the methods of the present invention include,but are not limited to, the naturally occurring nucleoside mono-, di-,and triphosphates: deoxyadenosine mono-, di- and triphosphate;deoxyguanosine mono-, di- and triphosphate; deoxythymidine mono-, di-and triphosphate; and deoxycytidine mono-, di- and triphosphate(referred to herein as dA, dG, dT and dC or A, G, T and C,respectively). Also useful are modified nucleotides such as nATP, nrCTP,rGTP, nrTTP, dinucleotide triphosphates 5′ppp-dNdN, and 5′ppp-rNrN,rCTP, rTTP 5-OH-dCTP, 7-deaza-7-nitro-dGTP, 5-OH-dUTP. Nucleotides alsoinclude, but are not limited to, modified nucleotides and nucleotideanalogs such as deazapurine nucleotides, e.g., 7-deaza-deoxyguanosine(7-deaza-dG) and 7-deaza-deoxyadenosine (7-deaza-dA) mono-, di- andtriphosphates, deutero-deoxythymidine (deutero-dT) mon-, di- andtriphosphates, methylated nucleotides e.g., 5-methyideoxycytidinetriphosphate, ¹³C/¹⁵ N labeled nucleotides and deoxyinosine mono-, di-and triphosphate, and 5′-amino-2′,5′-dideoxy analogs of adenosine,cytidine, guanosine, inosine and uridine. Also useful are7-deaza-7-nitro-dATP, 7-deaza-7-nitro-dGTP, 5-hydroxy-dCTP, and5-hydroxy-dUTP, or other modified nucleotides that have increasedchemical reactivity but are able to form standard Watson-Crick basepairs (see, e.g. Wolfe et al, PNAS 99:11073-11078). For those skilled inthe art, it will be clear that modified nucleotides and nucleotideanalogs can be obtained using a variety of combinations of functionalityand attachment positions.

Primers for the transcription can be random nucleotide primers orsequence-specific primers. For the method wherein no prior sequenceknowledge of the sequence is required, the primers are preferably randomprimers. As used herein, a “primer” refers to an oligonucleotide that issuitable for hybridizing, chain extension, amplification and sequencing.Similarly, a probe is a primer used for hybridization. The primer refersto a nucleic acid that is of low enough mass, typically about betweenabout 5 and 200 nucleotides, generally about 70 nucleotides or less than70, and of sufficient size to be conveniently used in the methods ofamplification and methods of detection and sequencing provided herein.These primers include, but are not limited to, primers for detection andsequencing of nucleic acids, which require a sufficient numbernucleotides to form a stable duplex, typically about 6-30 nucleotides,about 10-25 nucleotides and/or about 12-20 nucleotides. Thus, forpurposes herein, a primer is a sequence of nucleotides of any suitablelength, typically containing about 6-70 nucleotides, and all integers inbetween such as, 12-70 nucleotides or, for example 14-22, depending uponsequence and application of the primer.

The transcript is cleaved in a sequence-specific manner using eitherenzymatic or chemical cleavage methods. In one embodiment, photocleavagemethods can be used (Sauer et al., NAR 31:e63, pp. 1-10 2003). Usefulenzymatic cutters according to the methods of the invention include, butare not limited to widely available restriction enzymes and RNase T1,known to one skilled in the art. Useful chemical cutters according tothe methods of the invention include, but are not limited to, potassiumpermanganate (KMnO₄), 3-pyrrolidinol, and osmium tetraoxide (OsO₄).

Cleavage should be complete and produce only non-overlapping fragmentsin one reaction. Cleavage reaction with complex specificities mayrequire multiple reactions. Such multiple reaction may be performedeither simultaneously or serially. In the next step, the cleavagereaction products are analyzed either by length or my mass, preferablyby mass. However, the length analysis can be used, particularly, whenthe resulting fragments are known to only consist of single nucleotiderepeats. In the next step, using the combination of the mass/length ofthe fragment and the cleavage specificity of the nucleic acid cutter,one can calculate the molecular weights and sequences of all possiblefragments that can result from the cleavage (the fragment identitymapping). The mapping is dependent only on the cleavage reactions andnucleotides used and is totally independent of the sequence of thetarget. In the final step, the masses are compared with the fragmentidentity mapping to determine at least one subsequence that is presentin the target nucleic acid sequence.

In another embodiment, the invention provides a method of obtainingoverlapping fragments to enable complete sequencing of a target nucleicacid. In this embodiment, several parallel transcription, digesting,fragment mass analyses are performed to produce at least 2, 5, 10, 15,20, 50, 100 up to at least 1000 different sets of fragments, preferablycovering all or most of the target sequence, and compiling the sequenceof the target based on overlapping fragments after determination of thesequence of the subsequences as described above. In this method,multicutters that cut less frequently are preferred to obtain relativelylonger subfragments to allow identification of overlapping fragments.

In another embodiment, the invention provides a method to scan largenucleic acid templates for specific, low complexity nucleotide repeats,either single, di, tri etc. nucleic acid repeats. In such embodiment,the nucleotide cutters are either di, tri, etc. nucleotiderepeat-specific.

In one embodiment, the invention provides a method of determining thenumber of nucleotide repeats in a sequence. The number of fragments withidentical sequence can be determined using the surface area of the massspectrometric peak.

Partial sequencing by fragmentation (PSBF) according to the presentinvention is a method that uses a grouped multicutter to cleave a targetinto non-overlapping fragments, and then provides the complete basesequence (the identity) of every fragment. This is in direct contrast toall other fragmentation methods that can only provide the relative sizesor at best the molecular weights of the fragments produced aftercleavage. PSBF is a method for de novo sequencing—no prior informationabout the target is required.

Every PSBF reaction generates a known fixed fragment pool, which is thetotal set of possible fragments that result from cleavage. The basesequence and molecular weight of every member of the fixed fragment poolis totally and uniquely determined by the specific multicutter used inthe PSBF reaction and is independent of the sequence of the target. Thedata from a PSBF experiment indicates which members of the fragment poolwere produced during the cleavage reaction, and which ones were not.Since the base sequences of all fragments are known, PSBF effectivelyprovides a list of subsequences that are present in the target.

Fragment Identity Mapping (FIM) is a method of establishing a one-to-onecorrespondence between fragments with known base sequences and specificmasses. Under ordinary circumstances, it is not possible to determinethe sequence of bases in a fragment from its molecular weight alone(Bocker 2003). Under the conditions of a PSBF reaction, the molecularweight of a fragment can be used to determine its base sequence, as wellas the identity of the bases surrounding that fragment in the intacttarget.

A fragment identity mapping is established by employing a combination ofa grouped multicutter and a set of nucleotides with appropriate masses.In general, the following conditions are used:

(I) Every possible fragment that can be produced by cleavage with thespecified grouped multicutter should possess a unique base composition.For nucleic acids composed of four different nucleotides, this meansthat there can be no more than (L+1)(L+2)(L+3)/6 possible fragments atany given length L.

(II) Every possible base composition should have a unique molecularweight using the specified set of nucleotides. See, e.g., Cantor &Siddiqi (2003, U.S. Pat. No. 6,660,229) for a detailed discussion of amethod for selecting the nucleotide masses that meet this criteria.Briefly, since each of the four naturally occurring nucleotide bases dC,dT, dA and dG, also referred to herein as C, T, A and G, in DNA has adifferent molecular weight, M_(C)=289.2, M_(T)=304.2, M_(A)=313.2 andM_(G)=329.2, where M_(C), M_(T), M_(A), M_(G) are average molecularweights in daltons of the nucleotide bases deoxycytidine, thymidine,deoxyadenosine, and deoxyguanosine, respectively, it is possible to readan entire sequence in a single mass spectrum.

Stanton Jr. et al. (2003, U.S. Pat. No. 6,610,492) describes analternative method for assigning unique masses to oligonucleotides ofdifferent base compositions.

All fragment identity mappings are determined entirely by the choice ofmulticutter and nucleotides and are totally independent of the sequenceof the target. There are three types of fragment identity mappings:strict, relaxed, and limited. For strict mappings, condition (I) is truefor all fragments of all lengths and condition (II) is true for allmasses to infinity. For relaxed mappings, condition (I) is true only atcertain pre-determined fragment lengths while condition (II) is true forall masses to infinity. For limited mappings, condition (I) is true forall fragments of all lengths but condition (II) holds only across acertain pre-determined mass range. This is summarized in Table 6.

In general, strict mappings detect homopolymeric subsequences in thetarget, relaxed mappings detect tandem repeats in the target, andlimited mappings detect monotonic subsequences.

In order to determine if a given multicutter meets condition (I) and canbe used to establish a fragment identity mapping, the followingalgorithm is used:

At every fragment length L>1

Step 1: Form the set S_(L) of all 4^(L) possible fragments of length L;Step 2: Eliminate all fragments in S_(L) that are cleaved at least onceby the multicutter in question; Step 3: Eliminate all fragments in S_(L)that do not have conforming 5′ and 3′ termini; Step 4: Determine thenumber of different base compositions represented by the fragmentsremaining in S_(L); Step 5: If the number of fragments in S_(L) is equalto the number of base compositions calculated in Step 4, then themulticutter meets condition (I) and can potentially be used to establisha fragment identity mapping for fragments of length L.

In general, for a multicutter to be experimentally useful, it must meetcondition (I) for one or more fragment lengths L>3.

If the multicutter in question meets condition (I), then the followingalgorithm is used to determine if it meets condition (II) and forms afragment identity mapping using a specified set of nucleotides.

Step 1: Form the set S_(total) which is the union of all sets S_(L)calculated previously; Step 2: Calculate the molecular weight of eachfragment in S_(total) using the masses of specified nucleotides; Step 3:Determine which fragments in S_(total) have unique molecular weights.For the purposes of this discussion, a fragment has a unique molecularweight if no other fragments in S_(total) have masses that are closerthan one Dalton; Step 4: If there is at least one fragment for L>3 thathas a unique molecular weight, then the given combination of multicutterand nucleotides forms a fragment identity mapping.

Currently, there are few experimental methods available for specificcleavage of nucleic acids at short sequences such as tri- ordinucleotides (Wolfe et al. 2003). However, the methods of the presentinvention can be applied to any existing or new sequence-specificcutters. Here, the symbols M_(A), M_(C), M_(G), and M_(T) represent themolecular weights of the nucleotides A, C, G, and T, respectively. Thesymbols M_(α), M_(β), M_(γ) and M_(δ) represent the molecular weights ofthe nucleotides α, β, γ, and δ, respectively. The symbol M_(frag)represents the total molecular weight of an oligonucleotide fragment,while the symbol M_(term) represents the combined molecular weights ofany chemical groups at the 3′ and 5′ terminal ends of a fragment, suchas —OH and phosphate groups. All subscripted variables (i, k, v, w, x,z) used to represent numbers of specific bases in a fragment can onlyassume positive integer values.

Strict Fragment Identity Mappings

Simple Homopolymeric Subsequences

The simplest PSBF reaction utilizes the multicutter _(16/15)[inv(A.A)](also written as _(16/15)[A.B B.N]) which is part of the_(16/15)[inv(α.α)]:4 family. This multicutter effectively extracts onlythe homopolymeric regions of the target and produces an average fragmentlength of 1.067 bases. Statistics are shown in Table 7a.

This cleavage destroys ˜99% of the target, and can produce exactly onefragment at each length L of the form 5′-(A)_(L)-3′ for L>1. Thismulticutter is expected to produce only 2.94 detectable fragments perkilobase of target with an interfragment interval of 336 bases.

Fragments for L four through eight are shown in Table 7b.

It is clear by inspection, that no matter what the mass of thenucleotide A (or α), each possible fragment A₂, A₃, A₄, . . . A_(L) willhave a unique molecular weight given by M_(term)+L(M_(A)). Note that anygiven fragment A_(L) actually represents a sequence 5′-B(A)_(L)B-3′found somewhere in the target. Thus, the fragment AAA is not part of orequivalent to the fragment AAAA because these fragments originate fromthe sequences BAAAB and BAAAAB in the target.

Mass spectrometry has been adapted and used for sequencing and detectionof nucleic acid molecules (see, e.g., U.S. Pat. Nos. 6,194,144;6,225,450; 5,691,141; 5,547,835; 6,238,871; 5,605,798; 6,043,031;6,197,498; 6,235,478; 6,221,601; 6,221,605). In particular,Matrix-Assisted Laser Desorption/Ionization (MALDI) and ElectroSprayIonization (ESI), which allow intact ionization, detection and exactmass determination of large molecules, i.e. well exceeding 300 kDa inmass have been used for sequencing of nucleic acid molecules.

A further refinement in mass spectrometric analysis of high molecularweight molecules was the development of time of flight mass spectrometry(TOF-MS) with matrix-assisted laser desorption ionization (MALDI). Thisprocess involves placing the sample into a matrix that containsmolecules that assist in the desorption process by absorbing energy atthe frequency used to desorb the sample. Time of flight analysis usesthe travel time or flight time of the various ionic species as anaccurate indicator of molecular mass. As used herein, reference to massspectrometry encompasses any suitable mass spectrometric format known tothose of skill in the art. Such formats include, but are not limited to,Matrix-Assisted Laser Desorption/Ionization, Time-of-Flight (MALDI-TOF),Electrospray (ES), IR-MALDI (see, e.g., published International PCTapplication No. 99/57318 and U.S. Pat. No. 5,118,937), Ion CyclotronResonance (ICR), Fourier Transform and combinations thereof. MALDI,particular UV and IR, are among the preferred formats. Further detailsof the use of MALDI-TOF Mass Spectrometry are discussed in Jurinke etal., Molecular Biotechnology, Vol. 26, pp. 147-163, 2004.

As used herein, mass spectrum refers to the presentation of dataobtained from analyzing a biopolymer or fragment thereof by massspectrometry either graphically or encoded numerically.

As used herein, pattern with reference to a mass spectrum or massspectrometric analyses, refers to a characteristic distribution andnumber of signals (such peaks or digital representations thereof).

As used herein, signal in the context of a mass spectrum and analysisthereof refers to the output data, which the number or relative numberof molecules having a particular mass. Signals include “peaks” anddigital representations thereof.

As used herein, a “biological sample” refers to a sample of materialobtained from or derived from biological material, such as, but are notlimited to, body fluids, such blood, urine, cerebral spinal fluid andsynovial fluid, tissues and organs, plants, food products, organicmaterial contained in the soil and so forth. Derived from means thatsample can be processed, such as by purification or isolation and/oramplification of nucleic acid molecules.

Nomenclature and General Framework

The examples show PSBF using fragmentation methods that are specific tomono- or dinucleotides, however, PSBF reactions can be specific also tolonger subsequences in the target. These cleavages, in general, allowone skilled in the art to perform the method using the describedprinciple in light of the non-limiting examples provided in thespecification.

Mononucleotide Cleavages

The simplest possible cleavage is one that cuts at a single base, suchas cutting 5′ to every A in the target. We denote this cleavage [.A],with the period indicating that cleavage occurs 5′ to the specifiedbase. In this notation, [A.] would signify cleavage 3′ to every A in thetarget. Reactions that remove or destroy bases entirely (e.g. uracil DNAglycosylase) are represented as [.U.] and are considered to beequivalent to cleaving both 3′ and 5′ to the specified nucleotide.

Combined cleavages, such as simultaneously cutting 3′ to every A andevery G, would be represented as _(4/2)[A. G.] or _(4/2)[R.] using thestandard code for nucleotide degeneracies, shown in Table 1.

In general, we will refer to combined cleavages as Grouped Multicutters(GMCs or simply “multicutters”). The numerator of the subscriptedfraction in the notation indicates the total number of possiblemononucleotides, and the denominator is the group complexity, the numberof individual cleavages comprising the multicutter. This fraction alsogives the average fragment length for cleavage of random sequence, whichis 4/2=2.00 bases for [R.]. In this notation, the cleavage [A.] above isconsidered a grouped multicutter of complexity one and would be writtenas ^(4/1)[A.] even though it is not a combined cleavage.

It is sometimes easier to represent a multicutter in terms of thenucleotides that are not cleaved as opposed to the ones that arecleaved. The notation _(4/3)[inv(.T)] indicates that cleavage occurs 5′to every nucleotide except T. This is equivalent to _(4/3)[.A .C .G] or_(4/3)[.V]. Note that the denominator of the prefixed fraction mustalways be equal to the number of specific cleavages comprising themulticutter.

Polynucleotide Cleavages

A dinucleotide cleavage, such as cutting 3′ to A at every AC, isrepresented as _(16/1)[A.C], with the “16” being the total number ofpossible dinucleotides. For ordinary nucleic acids, the numerator of theprefixed fraction will be 4^(L), where L is the length of the sequencebeing cleaved. Cutting 5′ to the trinucleotide TTA would therefore bewritten as _(64/1)[.TTA]. It is always possible to represent a specificcleavage of a given length as a multicutter of a longer length. Forexample, _(4/1)[G.] is equivalent to _(16/4)[G.N], _(64/16)[NG.N], and_(64/16)[G.NN].

A multicutter may be composed of cleavage reactions that are specific todifferent-length sequences in the target. In these cases, theshorter-length cleavages are written as combined cleavages at the lengthof the longest cleavage in the group. For example, a grouped multicutterthat cleaves at [.A] and [T.G] would be written as _(16/5)[N.A T.G].When using the prefixed fraction notation, any cleavage that cutsmultiple times in their recognition sequences should be expressed as amulticutter of a longer length that cuts once at the same positionwithin the recognition sequences. For example, [.A.] would be rewrittenas _(16/7)[A.N N.A].

Representations that indicate polynucleotide sequences that are notcleaved follow the same pattern as described for mononucleotides. Forexample, cleavage 3′ to all dinucleotides except CT and AG would bewritten as _(16/14)[inv(CT. AG.)] and is equivalent to _(16/14)[AH. CV.KN.]. Note that in this notation, _(16/14)[inv(A.G G.A)] is not the sameas _(16/12)[inv(R.R)]. This latter multicutter is equivalent to_(16/12)[inv(A.A A.G G.A G.G)].

Fragments produced by cleavage with a give multicutter will have basesat their 5′ and 3′ termini that “conform” to the sequence specificity ofthat multicutter. For example, the multicutter [A.G] produces fragmentswhich have a 5′ terminal G and a 3′ terminal A. Fragments produced by amulticutter [inv(A.)] would have 5′ terminal A and 3′ terminal B.Multicutters with longer sequence specificities also follow thispattern. For example, [GT.A.C.V] produces fragments which have 5′terminal V and the dinucleotide sequence GT at their 3′ end. In thepresent specification, the term “fragment” denotes an oligonucleotideproduced by cleavage with a multicutter which has both 3′ and 5′conforming termini.

Generalized Nucleotides: Permutations

The total number of different grouped multicutters, T_(GMC) for anycleavage sequence length L, is given by:

T _(GMC)=2⁽⁴ ^(L) ⁾−1

For mononucleotides, there are 15 multicutters, for dinucleotides thereare 65535, and for trinucleotides there are ˜1.84×10¹⁹. It is notpractical to discuss each possible dinucleotide or trinucleotidemulticutter, and so we use the concept of Cleavage Family Equivalents(“cf-equivalents”). For example, consider _(16/1)[A.A]. This multicutteris part of the family that includes the other “repeated” dinucleotidecleavages: _(16/1)[C.C], _(16/1)[G.G], and _(16/1)[T.T]. Similarly,_(16/1)[A.C] is a member of the family that includes the eleven otherdinucleotides composed of two different bases. The members of a givenmulticutter family possess the same statistical properties for cleavageof random sequence even though their bases specificities are different.The remainder of this discussion will focus primarily on dinucleotidemulticutters, but the generalizations described are valid formulticutters of all lengths.

We will formalize the concept of cf-equivalents as follows: the symbolsα β γ δ will be used to denote a generalized set of four differentnucleotides. For cases where fewer than four nucleotides are used, α isalways the first nucleotide, β is always the second, and γ is always thethird. Thus the sequences AGGAG, TCCTC, and ATTAT, each taken inisolation, would be written as αββαβ, since they are each composed ofonly two nucleotides.

In this specification we have implicitly assumed that there are alwaysfour nucleotides with the following assignments: α=A, β=C, γ=G, and δ=T.There are twenty-three other possible assignments for a set of fourgeneric nucleotides, as shown in the Table 2.

Consider the multicutter _(16/4)[A.C C.A G.T T.G]. In order to find thecf-equivalents for this multicutter we first express it in terms ofgeneric nucleotides to yield: _(16/4)[α.β β.α γ.δ δ.γ]

then substitute the specific assignments for each of the 24permutations, and discard the duplicates. This yields _(16/4)[A.G C.TG.A T.C] and _(16/4)[A.T C.G G.C T.A], as the two other members of thefamily. This procedure is used to find the members of all genericmulticutter families.

Certain cleavage families, such as _(16/4)[α.α β.β γ.γ δ.δ] have onlyone member: _(16/4)[A.A C.C G.G T.T]. Other families, such as thegeneric multicutter _(16/2)[α.β β.γ] have the maximum possibletwenty-four members. We will use the notation _(16/2)[α.β β.γ]:24 toindicate the number of distinct multicutters in a given family. Whengrouped into cf-equivalents, the 65535 possible dinucleotidemulticutters represent only 3043 families, including the trivialmulticutter _(16/16)[N.N]:1 (or _(4/4)[N.]:1).

For the remainder of this specification, all sequences or cleavageswritten using the specific nucleotides A C G T are also considered to begeneric representations using α β γ δ. This also applies to sequences orcleavages written using the standard abbreviations for nucleotidedegeneracies. We use the symbol η to represent any of the four genericnucleotides α β γ δ (analogous to N for ordinary nucleotides). Thus_(16/2)[A.C C.A] represents all the multicutters in the family_(16/2)[α.β β.α]:6 and _(16/4)[A.M T.K] represents all the multicuttersin the family _(16/4)[α.α α.β δ.γ δ.δ]:12. This is notation isinterchangeable with _(16/4)[A.M T.K]:12 and is considered to beequivalent to it.

Analysis of Previous Non-Overlapping Fragmentation Methods

In order to show why previously described fragmentation techniques arenot capable of de novo sequencing we have analyzed the properties ofthree different cleavage families representative of these methods. Datawas obtained by simulating cleavage of a single target composed ofapproximately 10⁸ bases of random sequence. For each multicutter family,the following statistics were calculated:

(i) Fragment lengths L, in nucleotides. In general we have only showndata for fragments of 24 nucleotides or smaller, since these are thefragments most useful for MALDI-TOF mass spectrometry.(ii) The total number of different fragments possible at every length L.By definition, two fragments are different if and only if they havediffering base sequences. This value reflects the complexity of thefragment mixture at length L.(iii) The total number of different base compositions represented by thepossible fragments at every length L. Two different fragments possessingthe same base composition by definition must have the same molecularweight. Two fragments with differing base compositions may also have thesame molecular weight—this is dependent on the masses of the specificnucleotides present. The number of base compositions represents theupper bound on the number of distinct masses that the fragments oflength L can possess.(iv) The average number of fragments at every length L that are expectedto appear per 1000 bases of a target composed of random sequence. Thisstatistic provides a measure of how much useful information can beobtained from fragments at any given length as the size of targetincreases. We use the term detectable fragments to denote the averagetotal number of expected fragments that are longer than threenucleotides.(v) The average distance, along the intact target, between fragments oflength L or longer, in bases. This statistic provides a measure of howsparsely fragments larger than any given size are distributed along thetarget. We use the term interfragment interval to denote the averagedistance between fragments that are longer than three nucleotides.(vi) The fraction of the target bases, in percent, that are covered byfragments of length L.(vii) The cumulative fraction of the target bases, in percent, that arecovered by fragments length L or greater. This is a measure of how muchof the target is sampled by fragments longer than any given length.(viii) The fraction of the total number of fragments, in percent, thatare of length L.

Single-Base Cleavage

The first multicutter family we examine, _(4/1)[A.] or _(16/4)[A.N] isknown as single-base cleavage (Zabeau et al. 2000; Shchepinov et al.2001; Rodi et al. 2002) and produces an average fragment length of 4.00bases. The cf-equivalent representations for this family are_(4/1)[α.]:4 and _(16/4)[α.η]:4. Statistics for this multicutter areshown in the Table 3a.

The usable mass range for MALDI analysis of nucleic acids isapproximately 1100 Da to 10 kDa, which corresponds to fragments 4 to 30bases in length (Stanssens et al. 2004). This cleavage family thereforedestroys ˜26% of the target in the production of mono-, di-, andtrinucleotides. For targets with random base sequence, we would expectapproximately 105 detectable fragments per kilobase, with aninterfragment interval of 2.48 bases.

Single-base cleavage can generate 3^((L−1)) possible fragments at everyL, but only L(L+1)/2 possible compositions (and possible fragmentmasses). This is the reason that this type of cleavage cannot be usedfor de novo sequencing—there are simply too many different fragmentsthat have the same molecular weight. The possible fragments that can beproduced at length L is given by: 5′-(B)_((L-1))A-3′. Table 3b belowshows all possible fragments that can be generated for L=5. Capitalizedbases represent the actual fragments, lowercase bases indicate thecontext of (the bases adjacent to) the fragment in the intact target,and periods indicate where cleavage occurs. All sequences are written inthe 5′ to 3′ direction.

The fundamental limitation of single-base cleavage is that it totallydestroys regions of the target with a high occurrence of the cleavednucleotide, such as homopolymeric and low-complexity regions. Nearly 58%of the fragments produced provide no sequence information at all, andfully 25% of the fragments are mononucleotides.

In a method directed to detection of single nucleotide repeats, a singlenucleotide cutter is preferably used. For example, single nucleotidecutters are useful in a method to scan large templates, such as acomplete or partial chromosome, to identify regions of interest. Suchregions of interest include but are not limited to, for example, poly-Aregions. Identification of these regions allows estimating the number ofgenes in the chromosome or part of the chromosome by usingidentification of poly-A tails. Analysis of the number of fragmentsconsisting of multiple consecutive A-nucleotides can be performed, forexample, by calculating the surface area of the mass spectrometer peakand comparing it to the size of a peak from one single suchpoly-A-repeat. Naturally, the number of A nucleotides in differentpoly-A-tails varies, and the total number of genes is a sum of thenumber of all the different poly-A fragments as determined by their peaksizes in the mass spectra.

Relaxed Dinucleotide Cleavage

Zabeau et al. (2000) describes a variation of single-base cleavage thatpreserves homopolymeric regions of the cleaved nucleotide. Thismulticutter is _(16/3)[A.B] and is part of the _(16/3)[α.β α.γ α.δ]:4family. It produces fragments with an average length of 5.33 bases.Statistics are shown in Table 4a.

This multicutter cannot produce mononucleotides—the number of possiblefragments at any length L is approximately 1.5 times as great as thatfor single-base cleavage. The number of possible compositions is givenby (L(L+1)(L+2)/6−1). In other respects it is very similar tosingle-base cleavage. The possible fragments that can be produced atlength L>1 is given by: 5′-(B)_(i)(A)_(k)-3′, where (i+k)=L, 0<i<L, and0<k<L. This multicutter is expected to produce 117 detectable fragmentsper kilobase of target with an interfragment interval of 1.50 bases. Theslight increase in average fragment length and total target coveragecomes at the cost of greatly increased fragment complexity for any givenL. All possible fragments for L=5 are shown in Table 4b below.

Dinucleotide Cleavage

Stanton Jr. et al. (2003, U.S. Pat. No. 6,566,059) describes a methodfor cleaving at a specific dinucleotide composed of two different bases,_(16/1)[A.C]. This cleavage produces an average fragment length of 16.00bases and is part of the _(16/3)[α.β]:12 family. Statistics are shown inTable 5a.

Dinucleotide cleavage is far superior to single-base cleavage in termsof target coverage by fragments that are longer than 3 bases. Only about2% of the target is destroyed by this cleavage and only 12.5% of thefragments produced are di- or trinucleotides. The number of possiblefragments scales roughly in proportion to (3.73)^(L) while the number ofpossible compositions is given by L(L²−1)/6. This multicutter isexpected to produce 54.7 detectable fragments per kilobase of targetwith an interfragment interval of only 0.36 bases. The fragmentsproduced by this cleavage at any given L are a subset of5′-C(N)_((L-2))A-3′. All possible fragments for L=5 are shown in Table5b

An interesting property of dinucleotide cleavage is that for fragmentsshorter than seven bases, there are fewer possible fragments than can begenerated by single-base cleavage. This is due to the fact that both the5′ and 3′ terminal bases of all fragments are fixed for this cleavage.

Singly-Tagged Homopolymeric Subsequences

A related multicutter, _(4/3)[B.] or _(16/12)[B.N] (also written as_(4/3)[inv(A.)]) extracts singly-tagged homopolymeric subsequences fromthe target (the homopolymeric region plus one additional base). Thismulticutter is part of the _(4/3)[inv(α.)]:4 family and produces anaverage fragment length of 1.333 bases. Statistics are shown in Table8a.

This multicutter destroys ˜95% of the target, and can produce exactlythree fragments at each length L of the form 5′-(A)_((L-1))B-3′ for L>1.This cleavage is expected to produce 11.7 detectable fragments perkilobase of target with an interfragment interval of 81 bases. Fragmentsfor L four through eight are shown in the Table 8b.

The molecular weight of the possible fragments at every length L isgiven by M_(term)+(L−1)(M_(A))+M_(last), where M_(last) equals the massof the 3′ terminal base (M_(C), M_(G), or M_(T)) of the fragment. Inorder for this multicutter to produce a fragment identity mapping, eachpossible fragment must have a unique molecular weight. This is true aslong as the masses of the nucleotides C, G, and T (or β, γ, and δ) aredifferent. Stated formally, M_(C)≠M_(G)≠M_(C)≠M_(T), and M_(G)≠M_(T) (orM_(β)≠M_(γ), M_(β)≠M_(δ), and M_(γ)≠M_(δ)). Note that any one of theterminating nucleotides C, G, or T can have the same molecular weight asA and the fragment identity mapping will still hold.

Multiply-Tagged Homopolymeric Subsequences

Strict fragment identity mappings utilizing multicutters that cleaveonce into dinucleotide sequences can produce a maximum of eightdifferent fragments at each length L>2. In general these multicuttersextract multiply-tagged homopolymeric subsequences from the target (thehomopolymeric region plus up to three additional surrounding bases). Oneexample of this type of multicutter is _(16/9)[C.M V.K T.T], which ispart of the _(16/9)[α.γ β.η γ.γ η.δ]:24 family, which produces anaverage fragment length of 1.78 bases. Statistics are shown in Table 9a.

This multicutter destroys ˜90% of the target, and produces exactly eightfragments at each length L of that are a subset of 5′-DR(A)_((L-3))M-3′for L>2. This cleavage is expected to produce 23.5 detectable fragmentsper kilobase of target with an interfragment interval of 38.3 bases.Fragments for L four through eight are shown in Table 9b.

This multicutter produces a fragment identity mapping if the masses ofthe nucleotides A, C, G, and T are all different from each other.

Relaxed Fragment Identity Mappings Dinucleotide Repeats

An example of a relaxed fragment identity mapping is found in the PSBFreaction utilizing the multicutter _(16/14)[inv(A.C C.A)] (also written_(16/4)[A.D C.B K.N]) which is part of the _(16/14)[inv(α.β β.α)]:6family. This multicutter extracts dinucleotide repeats from the targetand generates an average fragment length of 1.143 bases. Statistics areshown in Table 10a.

This cleavage destroys ˜97% of the target, and can produce exactly twofragments at each length L for L>1. Fragments produced are of the form

5′-(AC)_((L/2))-3′ and 5′-(CA)_((L/2))-3′, for L even,

5′-C(AC)_(((L-1)/2))-3′ and 5′-A(CA)_(((L-1)/2))-3′, for L odd.

This multicutter is expected to produce 5.85 detectable fragments perkilobase of target with an interfragment interval of 166 bases.Fragments for L four through eight are shown in Table 10b.

This multicutter extracts from the target both reading frames of repeatsof the dinucleotide AC (or αβ). The masses of the fragments at each Lare given by:

M _(frag) =M _(term)+(L/2)(M _(A) +M _(C)), for L even

M _(frag) =M _(term)+((L−1)/2)(M _(A) +M _(C))+M _(odd), for L odd,

where M_(odd) equals either M_(A) or M_(C).

If the nucleotides A and C (α and β) have different masses (M_(A)≠M_(C)or M_(α)≠M_(β)) this multicutter establishes a fragment identity mappingfor all odd fragment lengths L.

Trinucleotide Repeats

A example of a multicutter that extracts trinucleotide repeats from thetarget is _(16/13)[inv(A.C C.G G.A)] (also written _(16/13)[A.D C.H G.BT.N]) which is part of the _(16/13)[in V(α.β β.γ γ.α)]:8 family. Thismulticutter generates an average fragment length of 1.231 bases.Statistics are shown in Table 11a.

This cleavage destroys ˜96% of the target, and produces exactly threefragments at each length L for L>1. This multicutter is expected toproduce 8.78 detectable fragments per kilobase of target with aninterfragment interval of 109.6 bases. Fragments for L four througheight are shown in Table 11b.

This multicutter extracts from the target all three reading frames ofrepeats of the trinucleotide ACG (or αβγ). The masses of the fragmentsat each L are given by:

M _(frag) =M _(term)+(L/3)(M _(ACG)), for L=3, 6, 9, 12, . . .

M _(frag) =M _(term)+((L−1)/3)(M _(ACG))+M _(X), for L=4, 7, 10, 13, . ..

M _(frag) =M _(term)+((L+1)/3)(M _(ACG))−M _(X), for L=5, 8, 11, 14, . ..

where M_(ACG)=(M_(A)+M_(C)+M_(G)) and where M_(X) equals one of M_(A),M_(C), or M_(G)

If the nucleotides A, C, and G (α, β, and γ) all have different masses(M_(A)≠M_(C), M_(C)≠M_(G), and M_(G)≠M_(A)) this multicutter establishesa fragment identity mapping for fragment lengths L=4, 5, 7, 8, 10, 11, .. . .

Tetranucleotide Repeats

An example of a multicutter that extracts tetranucleotide repeats fromthe target is _(16/12)[inv(A.C C.G G.T T.A)] (also written _(16/12)[A.DC.H G.V T.B]) which is part of the _(16/12)[in V(α.β β.γ γ.δ δ.α)]:6family. This multicutter generates an average fragment length of 1.333bases. Statistics are shown in Table 12a.

This cleavage destroys ˜95% of the target, and can produce exactly fourfragments at each length L. This multicutter is expected to produce 11.7detectable fragments per kilobase of target with an interfragmentinterval of 81 bases. Fragments for L four through eight are shown inthe Table 12b.

This multicutter extracts from the target all four reading frames ofrepeats of the tetranucleotide ACGT (or αβγδ). The masses of thefragments at each L>3 are given by:

M _(frag) =M _(term)+(L/4)(M _(ACGT)), for L=4, 8, 12, 16, . . .

M _(frag) =M _(term)+((L−1)/4)(M _(ACGT))+M _(X), for L=5, 9, 13, 17, .. .

M _(frag) =M _(term)+((L−2)/4)(M _(ACGT))+M _(Z), for L=6, 10, 14, 18, .. .

M _(frag) =M _(term)+((L+1)/4)(M _(ACGT))−M _(X), for L=7, 11, 15, 19, .. .

where M_(ACGT)=(M_(A)+M_(C)+M_(G)+M_(T)) where M_(X) equals one ofM_(A), M_(C), M_(G), or M_(T) and where M_(Z) equals one of(M_(A)+M_(C)), (M_(C)+M_(G)), (M_(G)+M_(T)), or (M_(T)+M_(A)).

If the nucleotides A, C, G, and T all have different masses thismulticutter establishes a fragment identity mapping for fragment lengthsL=5, 6, 7, 9, 10, 11, 13, 14, 15, . . . .

Tagged Dinucleotide Repeats

All of the relaxed fragment identity mappings presented thus far producea constant number of possible fragments at every L>2 but a varyingnumber of possible compositions. An example of a multicutter thatproduces a varying number of fragments at each L but a constant numberof compositions is _(16/11)[inv(A.T K.M)]:24 (also written _(16/11)[M.VB.K]:24). This multicutter extracts tagged dinucleotide repeats from thetarget (the repeat region plus up to two surrounding bases) andgenerates an average fragment length of 1.455 bases. Statistics areshown in Table 13a.

This cleavage destroys ˜93% of the target, and produces exactly fourfragments for L odd and exactly five fragments for L even. Thismulticutter is expected to produce 15.6 detectable fragments perkilobase of target with an interfragment interval of 59.8 bases.Fragments for L four through eight are shown in Table 13b.

This multicutter extracts from the target both reading frames of repeatsof the dinucleotide AT (or αδ) along with one or two additionalnucleotides C or G (β or γ). The masses of the fragments at each L>3 aregiven by:

M _(frag) =M _(term)+((L−1)/2)(M _(AT))+M _(X), for L odd, and

M _(frag) =M _(term)+((L/2)−1)(M _(AT))+M _(Z), for L even,

where M_(AT)=(M_(A)+M_(T)), where M_(X) equals one of M_(A), M_(C),M_(G), or M_(T), and where M_(Z) equals one of (M_(A)+M_(T)),(M_(G)+M_(A)), (M_(G)+M_(C)), or (M_(T)+M_(C)).

If the nucleotides A, C, G, and T all have different masses thismulticutter establishes a fragment identity mapping for all odd fragmentlengths L.

Limited Fragment Identity Mappings

All limited fragment identity mappings extract monotonic subsequencesfrom the target. We define a monotonic fragment of length L as having abase sequence of the form

5′-(α)_(v)(β)_(w)(γ)_(x)(δ)_(z)-3′,

where (v+w+x+z)=L, where 0≦v≦L, 0≦w≦L, 0≦x≦L, and 0≦z≦L.

By inspection, each different monotonic fragment of length L has aunique base composition. The mass of any monotonic fragment is given by:

M _(frag) =M _(term) +vM _(α) +wM _(β) +xM _(γ) +zM _(δ).

Limited fragment identity mappings hold only across a certainpre-defined mass range. The lower bound of this range is the mass of thesmallest detectable fragment, which is about 1100 Da in a MALDIinstrument. In general, the approximate upper bound of the mass rangecan be determined by finding the lowest mass at which any two differentfragments are within one Da of each other. Above this upper bound, themapping is relaxed, and certain masses, known beforehand, willcorrespond to two or more different fragments.

Monotonic Subsequences Composed of Two Different Nucleotides

An example of a multicutter that extracts the simplest type of monotonicsequences, those composed of only two different nucleotides, is_(16/13)[inv(A.A A.C C.C)] (also written _(16/13)[C.A M.K K.N]) which ispart of the _(16/13)[inv(α.α α.β β.β)]:12 family. This multicuttergenerates an average fragment length of 1.231 bases. Statistics areshown in Table 14a.

This cleavage destroys ˜94% of the target, and can produce exactly (L+1)fragments at each length L for L>1. This multicutter is expected toproduce 13.7 detectable fragments per kilobase of target with aninterfragment interval of 68.8 bases. Fragments produced are of the form

5′-(A)_(i)(C)_(k)-3′,

where (i+k)=L, 0≦i≦L, and 0≦k≦L

Fragments for L four through eight are shown in Table 14b.

Fragment masses are given by: M_(frag)=M_(term)+iM_(A)+kM_(C), where(i+k)=L, 0≦i≦L, and 0≦k≦L.

If the nucleotides A and C (α and β) have different masses (M_(A)≠M_(C)or M_(α)≠M_(β)) this multicutter establishes a limited fragment identitymapping.

Monotonic Subsequences Composed of Three Different Nucleotides

An example of a multicutter that extracts monotonic sequences composedof three different nucleotides is _(16/9)[B.V] (also written_(16/9)[inv(A.N N.T)]) which is part of the _(16/9)[inv(α.η η.β)]:12family. This multicutter generates an average fragment length of 1.778bases. Statistics are shown in Table 15a.

This cleavage destroys ˜84% of the target, and produces exactly (3L−1)fragments at each length L. This multicutter is expected to produce 35.2detectable fragments per kilobase of target with an interfragmentinterval of 24 bases. Fragments produced are of the form

5′-(A)_(i)(C)_(w)(G)_(x)(T)_(k)-3′,

where (i+k+w+x)=L, (w+x)≦1 0≦i≦L, 0≦k≦L, 0≦w≦1, 0≦x≦1.

Fragments for L four through eight are shown in Table 15b.

Fragment masses are given by

M _(frag) =M _(term) +iM _(A) +kM _(T) +wM _(C) +xM _(G),

where (i+k+w+x)=L, (w+x)≦1, and 0≦i≦L, 0≦k≦L, 0≦w≦1, 0≦x≦1.

If all nucleotides A, C, G, and T have different masses this multicutterestablishes a limited fragment identity mapping.

Monotonic Sequences Composed of Four Nucleotides

An example of a multicutter that extracts monotonic sequences composedof all four different nucleotides is _(16/6)[C.A G.M T.V] which is partof the _(16/6)[β.α γ.α γ.β δ.γ]:24 family. This multicutter generates anaverage fragment length of 2.667 bases. Statistics are shown in Table16a.

This cleavage destroys only ˜62% of the target, and can produce exactly

((L+1)(L+2)(L+3)/6−2)

fragments at each length L (the two “missing” fragments are5′-(A)_(L)-3′ and 5′-(T)_(L)-3′). This multicutter is expected toproduce 82 detectable fragments per kilobase of target with aninterfragment interval of 7.52 bases. Fragments produced are of the form

5′-(A)_(v)(C)_(w)(G)_(x)(T)_(z)-3′,

where (v+w+x+z)=L, where 0≦v≦L, 0≦w≦L, 0≦x≦L, and 0≦z≦L.

The mass of any fragment is given by

M _(frag) =M _(term) +vM _(A) +wM _(C) +xM _(G) +zM _(T)

Fragments for L four through six are shown in Table 16b.

If all nucleotides A, C, G, and T have different masses this multicutterestablishes a limited fragment identity mapping.

There are at least three key differences between partial sequencing byfragmentation (PSBF) and existing non-overlapping fragmentation (NOF)methods:

1) PSBF provides information about specific subsequences present in thetarget, while NOF methods give molecular weights or at best basecompositions of fragments. PSBF provides useful information even incases where it cannot assign a unique sequence to an observed fragmentmass.2) Mass spectra of PSBF cleavage reaction products can be unambiguouslyinterpreted without knowing the sequence of the target or a reference.All existing NOF sequencing methods are contingent on knowing areference sequence so that possible fragment masses can be calculatedbeforehand.3) PBSF generates far fewer detectable fragments for the same targetlength than do NOF methods, and interfragment intervals are typicallyten to a hundred times larger.

In general, PSBF method of the present invention is amenable to allsituations where NOF methods are currently utilized. The PSBF method ofthe present invention is especially useful for fingerprinting longtargets, as it generates a low number of detectable fragments. PSBF mayalso be combined with techniques for peak quantitation to determine therelative copy number of a given subsequence (Buetow et al. 2001; Bansalet al. 2002; Mohike et al. 2002). Specific advantages of particularnon-limiting example applications are discussed below.

Rapid Bacterial and Viral Identification

NOF methods have been used for genotypic identification andclassification of both known and unknown bacterial samples (vonWintzingerode et al. 2002, Lefmann et al. 2004). These methods arelimited to analysis of short signature regions (<2 kb) that have beenPCR amplified from the target bacteria.

Therefore, one embodiment of the present invention provides PSBF as ahighly effective method for genotypic identification and classificationof both known and unknown bacterial samples. The present method allowssampling of larger signature regions (at least in the range of 5-100kb). If highly destructive multicutters are used (those thatdestroy >98% of the target), then entire bacterial or viral genomes canbe sampled in a single reaction. Since reference target sequences arenot required for PSBF, totally uncharacterized targets could be analyzedand compared to each other and known samples, which is not possiblecurrently using NOF methods.

Discovery and Scoring of Tandem Repeat Regions

PSBF is also a useful method to rapidly score or discover tandem repeatsin both de novo and diagnostic settings. A primary advantage over NOFmethods in this application is that PSBF can extract all the repeatedregions present in the target at once, even if the sequences of thesurrounding regions are not known.

SNP Discovery and Detection

PSBF is also useful for SNP detection or discovery in cases where theSNP of interest occurs within a subsequence of the target that isdetected by the PSBF reaction. PSBF in general will sample a smallerfraction of the target per fragmentation reaction when compared to NOFmethods for this purpose. However, since reference sequences are notrequired, PSBF can be used to discover sequence variations in pools ofrelated sequences that have not been completely characterized.

EXAMPLES

Virtually all existing fragmentation methods utilize complete chemicalor enzymatic cleavage of a nucleic acid transcript of the target thatcontains modified nucleotides. The transcript is produced usingtemplate-dependent RNA or DNA polymerases that can incorporate themodified nucleotides. Specific primers (with promoter sequences for RNApolymerases) are usually employed.

In general, the methods implementing partial sequencing by fragmentationall rely on similar techniques as discussed herein. The general form ofsuch an implementation is shown in FIG. 1. In order to simplify the massspectra of the cleavage reaction products, any oligonucleotide primers(either random or specific) used in creation of the transcript should beremoved or designed so that they are totally destroyed by the cleavagereaction. In addition, all fragments should have identical 5′ termini aswell as identical 3′ termini (the 5′ termini may be different from the3′ termini, however).

The structures and molecular weights of nucleotides and nucleotideanalogs used in the following examples are shown in Table 17.

Example 1

The multicutter family _(16/15)[inv(α.α)]:4 can be implemented usingmodified nucleotides and chemical cleavage reactions described byStanton Jr. et al (2003, U.S. Pat. No. 6,610,492). The nucleotides usedfor each specific multicutter are shown in Table 18.

The modified nucleotides are incorporated during PCR amplification ofthe target and are chemically cleaved by KMnO₄ and 3-pyrrolidinol. Thiscleavage reaction totally destroys the modified nucleotides and producesfragments with both 5′ and 3′ phosphate groups (Wolfe et al. 2002). Thestrict fragment identity mappings for each multicutter in this familyare shown in Tables 19A and 19B.

Since this multicutter family can generate only one possible fragment atany given length L, the cleavage reaction products may be analyzed usingsingle-base-resolving electrophoresis. A sample target partiallysequenced using the multicutter _(16/15)[inv(A.A)] is shown in FIG. 2.In this example, the PCR amplification generates a double-strandedproduct, one strand of which is removed prior to performing the cleavagereaction. The cleavage reaction also destroys the primer entirely.

Example 2

The specific multicutter _(4/3)[inv(A.)] or _(4/3)[B.] can be easilyimplemented by cleaving an RNA transcript of the target with acombination of RNase T1 (cleaves 3′ to rG) and RNase A (cleaves 3′ to rCand rU). The one dalton mass difference between rC and rU, which isparticularly difficult to resolve, may be corrected by substituting5Me-rCTP for rCTP or 5Me-rUTP for rUTP during the transcriptionreaction. The RNase cleavage reactions should be performed underconditions that minimize production of 2′,3′ cyclic phosphate groups infavor of 3′ phosphates (Hartmer et al. 2003; Krebs et al. 2003).

Example 3

All the multicutters in the _(4/3)[inv(α.)]:4 family (also written_(4/3)[α. β. γ.]:4) can be implemented by producing a nucleic acidtranscript of the target using appropriate nucleotide triphosphates,followed by complete cleavage with alkali or nonspecific RNases.Nucleotides used to implement each specific multicutter are shown inTable 20.

Cleavage with alkali will produce fragments with 5′-OH groups and 2′,3′cyclic phosphate groups. These phosphate groups may be removedenzymatically, using alkaline phosphatase. The strict fragment identitymappings for each multicutter in this family are shown in Table 21.

A sample target partially sequenced using the multicutter_(4/3)[inv(A.)] (also written _(4/3)[B.]) is shown in FIG. 3. In thisexample, all terminal phosphate groups have been removed by alkalinephosphatase. The cleavage reaction destroys the primer entirely.

Multicutters Composed of Dinucleotide-Specific Cleavages Example 4

The multicutter family _(16/9)[inv(α.η η.β)]:12 can be implemented usingan enhancement of specific dinucleotide cleavage described by StantonJr. et al. (2003, U.S. Pat. No. 6,566,059). This method uses rNTPs and5′-amino-2′,5′-dideoxyribonucleotides (nNTPs). As presented,dinucleotides composed of two of the same nucleotide cannot be cleaved.This deficit may be overcome by using nucleotides with one of thefollowing structures shown in FIG. 4.

We refer to the first structure as a nrNTP and the second as a SrNTP. Inorder to implement the multicutters in this family, exactly three of thenucleotides must have 2′-OH groups while a different group of threenucleotides must all have 5′ amino groups. The nucleotides used for eachmulticutter are shown in Table 22.

Following polymerase-mediated cleavage of all adjacent 2′-OH andphosphoramidate groups, all fragments retain 2′,3′ cyclic phosphategroups. The multicutter _(16/9)[B.V] produces a relaxed fragmentidentity mapping as shown in Table 23.

Masses that are not part of the fragment identity mapping are shown inbold while fragments that cannot be unambiguously detected are shown initalic. The multicutter _(16/9)[B.H] produces a limited fragmentidentity mapping as shown in Tables 24A and 24B.

The upper bound for this limited fragment identity mapping is 3425 Da.Above this mass the mapping is relaxed. A sample target partiallysequenced using the multicutter _(16/9)[B.H] is shown in FIG. 5. Thecleavage reaction destroys the primer entirely.

Example 5

Kless (2001, WO 01/16366) describes a modified template-directedpolymerase that can accept dinucleotide triphosphates. In order for thedinucleotide triphosphate to be incorporated by the polymerase duringsynthesis, it must form two correct base pairs with the template. Themulticutter family _(64/59)[K.N A.D.N C.B.N]:12 may be implemented usingdinucleotide triphosphates with the structures shown in FIG. 6.

A transcript of the target is created using the nucleotides: rCTP, rGTP,rTTP, along with the dinucleotide triphosphates 5′ppp-dAdC, 5′ppp-rArA,5′ppp-rArG, and 5′ppp-rArT. The transcript is then completely cleavedwith alkali, producing fragments of the form

5′-(AC)_(k)A-3′

5′-(AC)_(k)C-3′

5′-(AC)_(k)G-3′

5′-(AC)_(k)T-3′,

where k=1, 2, 3, . . . .

This multicutter effectively extracts from the target one reading frameof all tandem repeats of the dinucleotide AC, along with the nucleotide3′ to the end of the repeat.

Example 6 Simulation of Fingerprinting and Bacterial Identification byPSBF

Lefmann et al. (2004) describes a method for genotypic identification ofbacteria using single-base cleavage of a ˜500 bp region of the 16Sribosomal RNA gene (rDNA). The masses of the fragments detected by massspectrometry, when compared to theoretical spectra calculated fromreference sequences, provide enough information to accurately identifyeach of twelve mycobacterial strains. We have simulated thefingerprinting and identification of these twelve strains using the PSBFimplementation of the multicutter family _(4/3)[inv(α.)]:4 described inExample 3. Table 25 below shows fragments of the forward strand of the16S rDNA region from each of twelve mycobacterial strains generated bythe multicutters in the _(4/3)[inv(α.)]:4 family. Fragments common toall strains are shown in lowercase, fragments useful for strainidentification are shown in uppercase, and all sequences are written inthe 5′ to 3′ direction. As shown in Table 21, each of the fragments inTable 25 has a unique and detectable molecular weight.

The multicutter _(4/3)[inv(T.)] provides largest number of usefulfragments, but it alone cannot be used to uniquely identify each strain.However, when combined with the multicutter _(4/3)[inv(G.)], each strainis unambiguously identifiable. In contrast to the method described byLefmann et al., reference sequences are not required to interpret thefragment data. Fingerprinting with PSBF also provides useful sequencedata, for example, of the twelve 16S rDNA sequences, only M. xenopi hasthe subsequences 5′-VTTTTTTG-3′ and 5′-HGGGGC-3′, only M. tuberculosishas the subsequence 5′-BAAAAG-3′, and only M. celatum has thesubsequence 5′-VTTTTTG-3′. Only M. gordonae lacks the subsequence5′-DCCCT-3′. It is possible that other mycobacterial strains maygenerate fingerprints identical to the ones shown in Table 25, in thiscase, PSBF may be used to analyze the reverse strand of the rDNA region,yielding a total of eight distinct fragment groups.

REFERENCES

All the references cited herein and throughout the specification areherein incorporated by reference in their entirety.

-   1. U.S. Pat. No. 6,660,229 B2-   2. WO 01/16366 (PCT/IL00/00515)-   3. U.S. Pat. No. 6,566,059 B1-   4. U.S. Pat. No. 6,582,923 B2-   5. U.S. Pat. No. 6,610,492 B1-   6. Zabeau, M. and Stanssens, P. (2000) Diagnostic Sequencing by a    Combination of Specific Cleavage and Mass Spectrometry.    International PCT Application WO 00/66771 (PCT/EP00/03904).-   7. Bansal A., van den Boom D., Kammerer S., Honisch C., Adam G.,    Cantor C. R., Kleyn P., and Braun A. (2002). Association testing by    DNA pooling: an effective initial screen. Proc Natl Acad Sci USA 99:    16871-4.-   8. Bocker S. (2003). SNP and mutation discovery using base-specific    cleavage and MALDI-TOF mass spectrometry. Bioinformatics 19 Suppl 1:    144-153.-   9. Buetow K. H., Edmonson M., MacDonald R., Clifford R., Yip P.,    Kelley J., Little D. P., Strausberg R., Koester H., Cantor C. R.,    and Braun A. (2001). High-throughput development and    characterization of a genomewide collection of gene-based single    nucleotide polymorphism markers by chip-based matrix-assisted laser    desorption/ionization time-of-flight mass spectrometry. Proc Natl    Acad Sci USA 98: 581-4.-   10. Ding C., and Cantor C. R. (2003). A high-throughput gene    expression analysis technique using competitive PCR and    matrix-assisted laser desorption ionization time-of-flight MS. Proc    Natl Acad Sci USA 100: 3059-64.-   11. Ding C., and Cantor C. R. (2003). Direct molecular haplotyping    of long-range genomic DNA with M1-PCR. Proc Natl Acad Sci USA 100:    7449-53.-   12. Ding C., and Cantor C. R. (2004). Quantitative analysis of    nucleic acids—the last few years of progress. J Biochem Mol Biol 37:    1-10.-   13. Elso C., Toohey B., Reid G. E., Poetter K., Simpson R. J., and    Foote S. J. (2002). Mutation detection using mass spectrometric    separation of tiny oligonucleotide fragments. Genome Res 12:    1428-33.-   14. Fu D. J., Broude N. E., Koster H., Smith C. L., and Cantor C. R.    (1996). Efficient preparation of short DNA sequence ladders    potentially suitable for MALDI-TOF DNA sequencing. Genet Anal 12:    137-42.-   15. Hartmer R., Storm N., Boecker S., Rodi C. P., Hillenkamp F.,    Jurinke C., and van den Boom D. (2003). RNase T1 mediated    base-specific cleavage and MALDI-TOF MS for high-throughput    comparative sequence analysis. Nucleic Acids Res 31: e47.-   16. Jurinke C., van den Boom D., Cantor C. R., and Koster H. (2001).    Automated genotyping using the DNA MassArray technology. Methods Mol    Biol 170: 103-16.-   17. Jurinke C., van den Boom D., Cantor C. R., and Koster H. (2002).    Automated genotyping using the DNA MassArray technology. Methods Mol    Biol 187: 179-92.-   18. Jurinke C., van den Boom D., Cantor C. R., and Koster H. (2002).    The use of MassARRAY technology for high throughput genotyping. Adv    Biochem Eng Biotechnol 77: 57-74.-   19. Jurinke C., van den Boom D., Jacob A., Tang K., Worl R., and    Koster H. (1996). Analysis of ligase chain reaction products via    matrix-assisted laser desorption/ionization time-of-flight-mass    spectrometry. Anal Biochem 237: 174-81.-   20. Koster H., Tang K., Fu D. J., Braun A., van den Boom D.,    Smith C. L., Cotter R. J., and Cantor C. R. (1996). A strategy for    rapid and efficient DNA sequencing by mass spectrometry. Nat    Biotechnol 14: 1123-8.-   21. Lefmann M., Honisch C., Bocker S., Storm N., von Wintzingerode    F., Schlotelburg C., Moter A., van den Boom D., and Gobel U. B.    (2004). Novel mass spectrometry-based tool for genotypic    identification of mycobacteria. J Clin Microbiol 42: 339-46.-   22. Li Y., Tang K., Little D. P., Koster H., Hunter R. L., and    McIver R. T., Jr. (1996). High-resolution MALDI Fourier transform    mass spectrometry of oligonucleotides. Anal Chem 68: 2090-6.-   23. Nordhoff E., Luebbert C., Thiele G., Heiser V., and Lehrach H.    (2000). Rapid determination of short DNA sequences by the use of    MALDI-MS. Nucleic Acids Res 28: E86.-   24. Rodi C. P., Darnhofer-Patel B., Stanssens P., Zabeau M., and van    den Boom D. (2002). A strategy for the rapid discovery of disease    markers using the MassARRAY system. Biotechniques Suppl: 62-6, 68-9.-   25. Shchepinov M. S., Denissenko M. F., Smylie K. J., Worl R. J.,    Leppin A. L., Cantor C. R., and Rodi C. P. (2001). Matrix-induced    fragmentation of P3′-N5′ phosphoramidate-containing DNA:    high-throughput MALDI-TOF analysis of genomic sequence    polymorphisms. Nucleic Acids Res 29: 3864-72.-   26. Siegert C. W., Jacob A., and Koster H. (1996). Matrix-assisted    laser desorption/ionization time-of-flight mass spectrometry for the    detection of polymerase chain reaction products containing    7-deazapurine moieties. Anal Biochem 243: 55-65.-   27. Smylie K. J., Cantor C. R., and Denissenko M. F. (2004).    Analysis of sequence variations in several human genes using    phosphoramidite bond DNA fragmentation and chip-based MALDI-TOF.    Genome Res 14: 134-41.-   28. Stanssens P., Zabeau M., Meersseman G., Remes G., Gansemans Y.,    Storm N., Hartmer R., Honisch C., Rodi C. P., Bocker S., and van den    Boom D. (2004). High-throughput MALDI-TOF discovery of genomic    sequence polymorphisms. Genome Res 14: 126-33.-   29. von Wintzingerode F., Bocker S., Schlotelburg C., Chiu N. H.,    Storm N., Jurinke C., Cantor C. R., Gobel U. B., and van den Boom D.    (2002). Base-specific fragmentation of amplified 16S rRNA genes    analyzed by mass spectrometry: a tool for rapid bacterial    identification. Proc Natl Acad Sci USA 99: 7039-44.-   30. Wolfe J. L., Kawate T., Belenky A., and Stanton V., Jr. (2002).    Synthesis and polymerase incorporation of    5′-amino-2′,5′-dideoxy-5′-N-triphosphate nucleotides. Nucleic Acids    Res 30: 3739-47.-   31. Wolfe J. L., Kawate T., Sarracino D. A., Zillmann M., Olson J.,    Stanton V. P., Jr., and Verdine G. L. (2002). A genotyping strategy    based on incorporation and cleavage of chemically modified    nucleotides. Proc Natl Acad Sci USA 99: 11073-8.-   32. Wolfe J. L., Wang B. H., Kawate T., and Stanton V. P., Jr.    (2003). Sequence-specific dinucleotide cleavage promoted by    synergistic interactions between neighboring modified nucleotides in    DNA. J Am Chem Soc 125: 10500-1.

TABLE 1 Nucleotides Abbreviation Represented Description R A/G purine YC/T pyrimidine M A/C amino K G/T keto W A/T weak S C/G strong B C/G/Tnot A D A/G/T not C H A/C/T not G V A/C/G not T N A/C/G/T any

TABLE 2 “Generic” Permutation Nucleotide No. α β γ δ  1 A C G T  2 A C TG  3 A G C T  4 A G T C  5 A T C G  6 A T G C  7 C A G T  8 C A T G  9 CG A T 10 C G T A 11 C T A G 12 C T G A 13 G A C T 14 G A T C 15 G C A T16 G C T A 17 G T A C 18 G T C A 19 T A C G 20 T A G C 21 T C A G 22 T CG A 23 T G A C 24 T G C A

TABLE 3A 4/1[A.] or 16/4[A.N] (iv) (v) (vi) (vii) (i) (ii) (iii) Averageno. of Average distance Fraction of target Cumulative fraction (viii)Fragment No. of Possible No. of Possible fragments of length betweenfragments bases covered by of target covered by Fraction of total lengthL Fragments of Compositions at L expected per of length L or fragmentsof fragments length L fragments at (nt) length L length L kilobase oftarget greater (bases) length L (%) or greater (%) length L (%) 1 1 162.5 0 6.25 100 25 2 3 3 46.9 0.33 9.38 93.75 18.76 3 9 6 35.2 1.1110.55 84.37 14.06 4 27 10 26.4 2.48 10.55 73.82 10.54 5 81 15 19.8 4.649.89 63.27 7.91 6 243 21 14.8 7.86 8.89 53.38 5.93 7 729 28 11.1 12.487.79 44.49 4.45 8 2187 36 8.3 18.97 6.67 36.70 3.33 9 6561 45 6.3 27.955.64 30.03 2.51 10 19683 55 4.7 40.28 4.70 24.39 1.88 11 59049 66 3.557.08 3.87 19.70 1.41 12 177147 78 2.6 79.81 3.16 15.82 1.05 13 53144191 2.0 110.31 2.59 12.66 0.80 14 1594323 105 1.5 151.70 2.08 10.08 0.5915 4782969 120 1.1 207.16 1.65 8.00 0.44 16 14348907 136 0.8 281.11 1.336.33 0.33

TABLE 3B Fragment Length: 5 bases a.CCCCA.n a.CCCGA.n a.CCCTA.na.CCGCA.n a.CCGGA.n a.CCGTA.n a.CCTCA.n a.CCTGA.n a.CCTTA.n a.CGCCA.na.CGCGA.n a.CGCTA.n a.CGGCA.n a.CGGGA.n a.CGGTA.n a.CGTCA.n a.CGTTA.na.CGTTA.n a.CTCCA.n a.CTCGA.n a.CTCTA.n a.CTGCA.n a.CTGGA.n a.CTGTA.na.CTTCA.n a.CTTGA.n a.CTTTA.n a.GCCCA.n a.GCCGA.n a.GCCTA.n a.GCGCA.na.GCGGA.n a.GCGTA.n a.GCTCA.n a.GCTGA.n a.GCTTA.n a.GGCCA.n a.GGCGA.na.GGCTA.n a.GGGCA.n a.GGGGA.n a.GGGTA.n a.GGTCA.n a.GGTGA.n a.GGTTA.na.GTCCA.n a.GTCGA.n a.GTCTA.n a.GTGCA.n a.GTGGA.n a.GTGTA.n a.GTTCA.na.GTTGA.n a.GTTTA.n a.TCCCA.n a.TCCGA.n a.TCCTA.n a.TCGCA.n a.TCGGA.na.TCGTA.n a.TCTCA.n a.TCTGA.n a.TCTTA.n a.TGCCA.n a.TGCGA.n a.TGCTA.na.TGGCA.n a.TGGGA.n a.TGGTA.n a.TGTCA.n a.TGTGA.n a.TGTTA.n a.TTCCA.na.TTCGA.n a.TTCTA.n a.TTGCA.n a.TTGGA.n a.TTGTA.n a.TTTCA.n a.TTTGA.na.TTTTA.n

TABLE 4A 16/3[A.B] No. of No. of Average no. of Average distanceFraction of target Cumulative fraction Fragment Possible Possiblefragments of length between fragments bases covered by of target coveredby Fraction of total length L Fragments of Compositions L expected perof length L or fragments of fragments length L fragments at (nt) lengthL at length L kilobase of target greater (bases) length L (%) or greater(%) length L (%) 1 0 0 0 0 0 100 0 2 3 3 35.2 0 7.03 100 18.75 3 12 935.2 0.46 10.55 92.97 18.75 4 39 19 28.6 1.50 11.43 82.42 16.24 5 120 3422.0 3.27 10.99 70.99 11.72 6 363 55 16.6 6.00 9.98 60.00 8.87 7 1092 8312.5 9.99 8.74 60.02 6.66 8 3279 119 9.4 15.65 7.51 41.28 5.00 9 9840164 7.1 23.53 6.35 33.77 3.76 10 29523 219 5.3 34.41 5.28 27.42 2.81 1188572 285 3.9 49.22 4.34 22.15 2.11 12 265719 363 3.0 69.25 3.57 17.801.58 13 797160 454 2.2 96.38 2.89 14.24 1.18 14 2391483 559 1.7 132.782.34 11.35 0.89 15 7174452 679 1.3 181.75 1.88 9.01 0.67 16 21523359 8150.9 247.57 1.50 7.13 0.50

TABLE 4B Fragment Length: 5 bases a.CAAAA.b a.CCAAA.b a.CCCAA.ba.CCCCA.b a.CCCGA.b a.CCCTA.b a.CCGAA.b a.CCGCA.b a.CCGGA.b a.CCGTA.ba.CCTAA.b a.CCTCA.b a.CCTGA.b a.CCTTA.b a.CGAAA.b a.CGCAA.b a.CGCCA.ba.CGCGA.b a.CGCTA.b a.CGGAA.b a.CGGCA.b a.CGGGA.b a.CGGTA.b a.CGTAA.ba.CGTCA.b a.CGTGA.b a.CGTTA.b a.CTAAA.b a.CTCAA.b a.CTCCA.b a.CTCGA.ba.CTCTA.b a.CTGAA.b a.CTGCA.b a.CTGGA.b a.CTGTA.b a.CTTAA.b a.CTTCA.ba.CTTGA.b a.CTTTA.b a.GAAAA.b a.GCAAA.b a.GCCAA.b a.GCCCA.b a.GCCGA.ba.GCCTA.b a.GCGAA.b a.GCGCA.b a.GCGGA.b a.GCGTA.b a.GCTAA.b a.GCTCA.ba.GCTGA.b a.GCTTA.b a.GGAAA.b a.GGCAA.b a.GGCCA.b a.GGCGA.b a.GGCTA.ba.GGGAA.b a.GGGCA.b a.GGGGA.b a.GGGTA.b a.GGTAA.b a.GGTCA.b a.GGTGA.ba.GGTTA.b a.GTAAA.b a.GTCAA.b a.GTCCA.b a.GTCGA.b a.GTCTA.b a.GTGAA.ba.GTGCA.b a.GTGGA.b a.GTGTA.b a.GTTAA.b a.GTTCA.b a.GTTGA.b a.GTTTA.ba.TAAAA.b a.TCAAA.b a.TCCAA.b a.TCCCA.b a.TCCGA.b a.TCCTA.b a.TCGAA.ba.TCGCA.b a.TCGGA.b a.TCGTA.b a.TCTAA.b a.TCTCA.b a.TCTGA.b a.TCTTA.ba.TGAAA.b a.TGCAA.b a.TGCCA.b a.TGCGA.b a.TGCTA.b a.TGGAA.b a.TGGCA.ba.TGGGA.b a.TGGTA.b a.TGTAA.b a.TGTCA.b a.TGTGA.b a.TGTTA.b a.TTAAA.ba.TTCAA.b a.TTCCA.b a.TTCGA.b a.TTCTA.b a.TTGAA.b a.TTGCA.b a.TTGGA.ba.TTGTA.b a.TTTAA.b a.TTTCA.b a.TTTGA.b a.TTTTA.b

TABLE 5A 16/1[A.C] No. of Average no. of Average distance Fraction oftarget Cumulative fraction No. of Possible Possible fragments of lengthbetween fragments bases covered by of target covered by Fraction ofFragment Fragments of Compositions L expected per of length L orfragments of fragments length L total fragments length L (nt) length Lat length L kilobase of target greater (bases) length L (%) or greater(%) at length L (%) 1 0 0 0 0 0 100 0 2 1 1 3.90 0 0.78 100 6.25 3 4 43.91 0.13 1.17 99.22 6.26 4 15 10 3.67 0.36 1.47 98.05 5.87 5 56 20 3.410.67 1.71 96.58 5.46 6 209 35 3.18 1.08 1.91 94.87 5.10 7 780 56 2.971.58 2.08 92.96 4.75 8 2911 84 2.78 2.20 2.22 90.89 4.45 9 10864 1202.59 2.93 2.33 88.66 4.15 10 40545 165 2.41 3.79 2.41 86.33 3.86 11151316 220 2.26 4.78 2.48 83.92 3.61 12 564719 286 2.11 5.91 2.53 81.443.37 13 2107560 364 1.96 7.20 2.55 78.91 3.14 14 7865521 455 1.83 8.652.56 76.36 2.93 15 29354524 560 1.71 10.28 2.56 73.80 2.73 16 109552575680 1.60 12.09 2.55 71.23 2.55 17 408855776 816 1.49 14.11 2.53 68.682.38 18 1525870529 969 1.39 16.34 2.51 66.15 2.23 19 5694626340 11401.30 18.82 2.46 63.65 2.07 20 21252634831 1330 1.21 21.54 2.41 61.181.93 21 79315912984 1540 1.12 24.52 2.36 58.77 1.80 22 296011017105 17711.05 27.77 2.31 56.41 1.68 23 1104728155436 2024 0.98 31.34 2.25 54.111.57 24 4122901604639 2300 0.92 35.23 2.20 51.86 1.47

TABLE 5B Fragment Length: 5 bases a.CAAAA.c a.CAAGA.c a.CAATA.ca.CAGAA.c a.CAGCA.c a.CAGGA.a a.CAGTA.c a.CATAA.c a.CATCA.c a.CATGA.ca.CATTA.c a.CCAAA.c a.CCAGA.c a.CCATA.c a.CCCAA.c a.CCCCA.c a.CCCGA.ca.CCCTA.c a.CCGAA.c a.CCGCA.c a.CCGGA.c a.CCGTA.c a.CCTAA.c a.CCTCA.ca.CCTGA.c a.CCTTA.c a.CGAAA.c a.CGAGA.c a.CGATA.c a.CGCAA.c a.CGCCA.ca.CGCGA.c a.CGCTA.c a.CGGAA.c a.CGGCA.c a.CGGGA.c a.CGGTA.c a.CGTAA.ca.CGTCA.c a.CGTGA.c a.CGTTA.c a.CTAAA.c a.CTAGA.c a.CTATA.c a.CTCAA.ca.CTCCA.c a.CTCGA.c a.CTCTA.c a.CTGAA.c a.CTGCA.c a.CTGGA.c a.CTGTA.ca.CTTAA.c a.CTTCA.c a.CTTGA.c a.CTTTA.c

TABLE 6 Condition Type of Fragment (I) Every possible fragment has a(II) Every possible base composition Identity Mapping unique basecomposition has a unique molecular weight Strict True for all fragmentsof all lengths True for all masses to infinity Relaxed True only atcertain fragment lengths True for all masses to infinity Limited Truefor all fragments of all lengths True for a finite range of masses

TABLE 7A 16/15[inv(A.A)] or 16/15[A.B B.N] Average no. of Averagedistance Fraction of target Cumulative fraction Fragment No. of PossibleNo. of Possible fragments of length between fragments bases covered byof target covered Fraction of total length L Fragments of Compositions Lexpected per of length L or greater fragments of by fragments lengthfragments at (nt) length L at length L kilobase of target (bases) lengthL (%) L or greater (%) length L (%) 1 4 4 890.7 0 89.07 100 95 2 1 135.14 19 7.03 10.94 3.75 3 1 1 8.78 82 2.63 3.91 0.937 4 1 1 2.20 3360.881 1.27 0.235 5 1 1 0.55 1356 0.276 0.392 0.0590 6 1 1 0.138 64900.083 0.115 0.0147 7 1 1 0.033 22596 0.023 0.032 0.0036 8 1 1 0.008092383 0.0064 0.0091 0.00085 9 1 1 0.00203 354390 0.0018 0.0027 0.0002210 1 1 0.00053 995260 0.00053 0.00087 0.000056

TABLE 7B Fragment Length (bases) 4 5 6 7 8 b.AAAA.b b.AAAAA.b b.AAAAAA.bb.AAAAAAA.b b.AAAAAAAA.b

TABLE 8A 4/3[B.] or 16/12[B.N] Average no. of Average distance Fractionof target Cumulative fraction Fragment No. of Possible No. of Possiblefragments of length between fragments bases covered by of target coveredFraction of total length L Fragments of Compositions L expected per oflength L or greater fragments of by fragments length fragments at (nt)length L at length L kilobase of target (bases) length L (%) L orgreater (%) length L (%) 1 3 3 562.7 0 56.27 100 75 2 3 3 140.6 3 28.1143.73 18.75 3 3 3 35.15 18 10.54 15.62 4.69 4 3 3 8.782 81 3.51 5.081.17 5 3 3 2.196 336 1.10 1.56 0.293 6 3 3 0.552 1355 0.331 0.4649 0.0747 3 3 0.1364 5472 0.095 0.1339 0.018 8 3 3 0.0351 21684 0.028 0.03840.005 9 3 3 0.0082 90182 0.0074 0.0103 0.0011 10 3 3 0.0020 3595300.0020 0.0029 0.0003

TABLE 8B Fragment Length (bases) 4 5 6 7 8 b.AAAC.n b.AAAAC.n b.AAAAAC.nb.AAAAAAC.n b.AAAAAAAC.n b.AAAG.n b.AAAAG.n b.AAAAAG.n b.AAAAAAG.nb.AAAAAAAG.n b.AAAT.n b.AAAAT.n b.AAAAAT.n b.AAAAAAT.n b.AAAAAAAT.n

TABLE 9A 16/9[C.M V.K T.T] No. of Average no. of Average distanceFraction of target Cumulative fraction No. of Possible Possiblefragments of length between fragments bases covered by of target coveredby Fraction of Fragment Fragments of Compositions L expected per oflength L or fragments of fragments length L or total fragments length L(nt) length L at length L kilobase of target greater (bases) length L(%) greater (%) at length L (%) 1 4 4 250.0 0 25.00 100 44.44 2 7 7218.7 0.800 43.74 75.00 38.89 3 8 8 70.33 7.329 21.10 31.26 12.50 4 8 817.59 38.30 7.036 10.17 3.13 5 8 8 4.404 165.1 2.202 3.129 0.783 6 8 81.096 676.9 0.657 0.927 0.195 7 8 8 0.275 2711 0.192 0.270 0.049 8 8 80.0699 10742 0.056 0.077 0.012 9 8 8 0.0173 43389 0.016 0.0215 0.0031 108 8 0.00430 178378 0.0043 0.0060 0.0008

TABLE 9B Fragment Length (bases) 4 5 6 7 8 c.AAAA.k c.AAAAA.k c.AAAAAA.kc.AAAAAAA.k c.AAAAAAAA.k c.AAAC.n c.AAAAC.n c.ALAAAC.n c.AAAAAAC.nc.AAAAAAAC.n v.GAAA.k v.GAAAA.k v.GAAAAA.k v.GAAAAAA.k v.GAAAAAAA.kv.GAAC.n v.GAAAC.n v.GAAAAC.n v.GAAAAAC.n v.GAAAAAAC.n n.TAAA.kn.TAAAA.k n.TAAAAA.k n.TAAAAAA.k n.TAAAAAAA.k n.TAAC.n n.TAAAC.nn.TAAAAC.n n.TAAAAAC.n n.TAAAAAAC.n n.TGAA.k n.TGAAA.k n.TGAAAA.kn.TGAAAAA.k n.TGAAAAAA.k n.TGAC.n n.TGAAC.n n.TGAAAC.n n.TGAAAAC.nn.TGAAAAAC.n

TABLE 10A 16/14[inv(A.C C.A)] or 16/14[A.D C.B K.N] Average no. ofAverage distance Fraction of target Cumulative fraction Fragment No. ofPossible No. of Possible fragments of length L between fragments basescovered by of target covered by Fraction of total length L Fragments ofCompositions expected per of length L or fragments of fragments length Lfragments at (nt) length L at length L kilobase of target greater(bases) length L (%) or greater (%) length L (%) 1 4 4 781.47 0 78.15100 89.30 2 2 1 70.21 8.34 14.04 21.85 8.02 3 2 2 17.58 39.3 5.27 7.812.01 4 2 1 4.388 166 1.76 2.54 0.501 5 2 2 1.102 677 0.551 0.781 0.126 62 1 0.273 2745 0.164 0.230 0.031 7 2 2 0.070 11005 0.049 0.066 0.0080 82 1 0.015 47656 0.012 0.0176 0.0017 9 2 2 0.0039 169504 0.0036 0.005370.00045 10 2 1 0.00140 546880 0.0014 0.00182 0.00016

TABLE 10B Fragment Length (bases) 4 5 6 7 8 d.ACAC.b d.ACACA.dd.ACACAC.b d.ACACACA.d d.ACACACAC.b b.CACA.d b.CACAC.b b.CACACA.db.CACACAC.b b.CACACACA.d

TABLE 11A 16/13[Inv(A.C C.G G.A)] or 16/13[A.D C.H G.B T.N] Average no.of Average distance Fraction of target Cumulative fraction Fragment No.of Possible No. of Possible fragments of length between fragments basescovered by of target covered by Fraction of total length L Fragments ofCompositions L expected per of length L or fragments of fragments lengthL or fragments at (nt) length L at length L kilobase of target greater(bases) length L (%) greater (%) length L (%) 1 4 4 671.89 0 67.19 10082.69 2 3 3 105.45 4.78 21.09 32.81 12.98 3 3 1 26.39 25.10 7.92 11.723.25 4 3 3 6.686 109.6 2.63 3.80 0.811 5 3 3 1.644 450.5 0.82 1.17 0.2026 3 1 0.413 1813 0.26 0.348 0.051 7 3 3 0.102 7299 0.072 0.100 0.013 8 33 0.026 28970 0.021 0.0287 0.0032 9 3 1 0.0067 113866 0.0060 0.00810.00082 10 3 3 0.0016 468803 0.0016 0.0021 0.00020

TABLE 11B Fragment Length (bases) 4 5 6 7 8 h.ACGA.d h.ACGAC.hh.ACGACG.b h.ACGACGA.d h.ACGACGAC.h b.CGAC.h b.CGACG.b b.CGACGA.dh.CGACGAC.h b.CGACGACG.b d.GACG.b d.GACGA.d d.GACGAC.h d.GACGACG.bd.GACGACGA.d

TABLE 12A 16/12[inv(A.C C.G G.T T.A)] or 16/12[A.D C.H G.V T.B] Averageno. of Average distance Fraction of target Cumulative fraction No. ofPossible No. of Possible fragments of length between fragments basescovered by of target covered by Fraction of total Fragment Fragments ofCompositions L expected per of length L or fragments of fragments lengthL fragments at length L (nt) length L at length L kilobase of targetgreater (bases) length L (%) or greater (%) length L (%) 1 4 4 562.54 056.25 100 75 2 4 4 140.65 3 28.13 43.75 18.75 3 4 4 35.15 18 10.55 15.624.69 4 4 1 8.77 81 3.51 5.07 1.17 5 4 4 2.19 336 1.10 1.56 0.292 6 4 40.553 1355 0.332 0.465 0.074 7 4 4 0.136 5493 0.095 0.133 0.018 8 4 10.0336 21994 0.027 0.038 0.0045 9 4 4 0.00867 84028 0.0078 0.011 0.001210 4 4 0.00238 303764 0.0024 0.0033 0.00032

TABLE 12B Fragment Length (bases) 4 5 6 7 8 v.ACGT.b v.ACGTA.dv.ACGTAC.h v.ACGTACG.v v.ACGTACGT.b b.CGTA.d b.CGTAC.h b.CGTACG.vb.CGTACGT.b b.CSTACGTA.d d.GTAC.h d.GTACG.v d.GTACGT.b d.GTACGTA.dd.GTACGTAC.h h.TACG.v h.TACGT.b h.TACGTA.d h.TACGTAC.h h.TACGTACG.v

TABLE 13A 16/11[inv(A.T K.M)] or 16/11[M.V B.K] Average no. of Averagedistance Fraction of target Cumulative fraction of No. of Possible No.of Possible fragments of length between fragments bases covered bytarget covered by Fraction of Fragment Fragments of Compositions Lexpected per of length L or fragments of fragments length L or totalfragments length L (nt) length L at length L kilobase of target greater(bases) length L (%) greater (%) at length L (%) 1 4 4 437.47 0 43.75100 63.63 2 5 4 207.06 1.75 41.41 56.25 30.12 3 4 4 27.34 19.82 8.2014.84 3.98 4 5 4 12.93 59.78 5.17 6.64 1.88 5 4 4 1.709 366.7 0.854 1.470.249 6 5 4 0.809 1016 0.485 0.612 0.118 7 4 4 0.106 5912 0.074 0.1260.015 8 5 4 0.052 15915 0.042 0.052 0.0076 9 4 4 0.0068 92727 0.00610.010 0.0010 10 5 4 0.0033 238619 0.0033 0.0042 0.00047

TABLE 13B Fragment Length (bases) 4 5 6 7 8 m.ATAT.k m.ATATA.vm.ATATAT.k m.ATATATA.v m.ATATATAT.k n.GATA.v m.ATATC.n n.GATATA.vm.ATATATC.n n.GATATATA.v n.GATC.n n.GATAT.k n.GATATC.n n.GATATAT.kn.GATATATC.n b.TATA.v b.TATAT.k b.TATATA.v b.TATATAT.k b.TATATATA.vb.TATC.n b.TATATC.n b.TATATATC.n

TABLE 14A 16/13[Inv(A.A A.C C.C)] or 16/13[C.A M.K K.N] Average no. ofAverage distance Fraction of target Cumulative fraction of Fragment No.of Possible No. of Possible fragments of length between fragments basescovered by target covered by Fraction of total length L Fragments ofCompositions L expected per of length L or fragments of fragments lengthL or fragments at (nt) length L at length L kilobase of target greater(bases) length L (%) greater (%) length L (%) 1 4 4 687.58 0 68.76 10084.62 2 3 3 82.03 5.50 16.41 31.24 10.10 3 4 4 29.28 19.83 8.78 14.843.60 4 5 5 9.52 68.75 3.81 6.05 1.17 5 6 6 2.93 235.7 1.47 2.24 0.361 67 7 0.867 816.2 0.520 0.778 0.107 7 8 8 0.248 2866 0.174 0.258 0.031 8 99 0.072 10013 0.058 0.084 0.0089 9 10 10 0.020 36258 0.018 0.026 0.002410 11 11 0.0055 126915 0.0055 0.0082 0.00067 11 12 12 0.0018 4124050.0020 0.0028 0.00022

TABLE 14B Fragment Length (bases) 4 5 6 7 8 b.AAAA.k b.AAAAA.kb.AAAAAA.k b.AAAAAAA.k b.AAAAAAAA.k b.ARAC.d b.AAAAC.d b.AAAAAC.db.AAAAAAC.d b.AAAAAAAC.d b.AACC.d b.AAACC.d b.AAAACC.d b.AAAAACC.db.AAAAAACC.d b.ACCC.d b.AACCC.d b.AAACCC.d b.AAAACCC.d b.AAAAACCC.dk.CCCC.d b.ACCCC.d b.AACCCC.d b.AAACCCC.d b.AAAACCCC.d k.CCCCC.db.ACCCCC.d b.AACCCCC.d b.AAACCCCC.d k.CCCCCC.d b.ACCCCCC.d b.AACCCCCC.dk.CCCCCCC.d b.ACCCCCCC.d k.CCCCCCCC.d

TABLE 15A 16/9[B.V] No. of Average no. of Average distance Fraction oftarget Cumulative fraction No. of Possible Possible fragments of lengthbetween fragments bases covered by of target covered by Fraction oftotal Fragment Fragments of Compositions L expected per of length L orfragments of fragments length L fragments at length L (nt) length L atlength L kilobase of target greater (bases) length L (%) or greater (%)length L (%) 1 2 2 281.27 0 28.13 100 50 2 5 5 175.81 1 35.16 71.8731.25 3 8 8 70.30 6 21.09 36.71 12.50 4 11 11 24.18 24 9.67 15.62 4.30 514 14 7.69 85.66 3.85 5.95 1.37 6 17 17 2.33 297.75 1.40 2.11 0.414 7 2020 0.686 1033.1 0.480 0.711 0.122 8 23 23 0.197 3630.4 0.158 0.231 0.0359 26 26 0.056 12829 0.050 0.073 0.010 10 29 29 0.016 46024 0.016 0.0230.0028 11 32 32 0.0045 162770 0.0050 0.0070 0.0008 12 35 35 0.0013635812 0.0015 0.0021 0.00023

TABLE 15B Fragment Length (bases) 4 5 6 7 8 b.AAAC.v b.AAAAC.vb.AAAAAC.v b.AAAAAAC.v b.AAAAAAAC.v b.AAAG.v b.AAAAG.v b.AAAAAG.vb.AAAAAAG.v b.AAAAAAAG.v b.AAAT.v b.AAAAT.v b.AAAAAT.v b.AAAAAAT.vb.AAAAAAAT.v b.AACT.v b.AAACT.v b.AAAACT.v b.AAAAACT.v b.AAAAAACT.vb.AAGT.v b.AAAGT.v b.AAAAGT.v b.AAAAAGT.v b.AAAAAAGT.v b.AATT.vb.AAATT.v b.AAAATT.v b.AAAAATT.v b.AAAAAATT.v b.ACTT.v b.AACTT.vb.AAACTT.v b.AAAACTT.v b.AAAAACTT.v b.AGTT.v b.AAGTT.v b.AAAGTT.vb.AAAAGTT.v b.AAAAAGTT.v b.ATTT.v b.AATTT.v b.AAATTT.v b.AAAATTT.vb.AAAAATTT.v b.CTTT.v b.ACTTT.v b.AACTTT.v b.AAACTTT.v b.AAAACTTT.vb.GTTT.v b.AGTTT.v b.AAGTTT.v b.AAAGTTT.v b.AAAAGTTT.v b.ATTTT.vb.AATTTT.v b.AAATTTT.v b.AAAATTTT.v b.CTTTT.v b.ACTTTT.v b.AACTTTT.vb.AAACTTTT.v b.GTTTT.v b.AGTTTT.v b.AAGTTTT.v b.AAAGTTTT.v b.ATTTTT.vb.AATTTTT.v b.AAATTTTT.v b.CTTTTT.v b.ACTTTTT.v b.AACTTTTT.v b.GTTTTT.vb.AGTTTTT.v b.AAGTTTTT.v b.ATTTTTT.v b.AATTTTTT.v b.CTTTTTT.vb.ACTTTTTT.v b.GTTTTTT.v b.AGTTTTTT.v b.ATTTTTTT.v b.CTTTTTTT.vb.GTTTTTTT.v

TABLE 16A 16/6[C.A G.M T.V] Average no. of Average distance Fraction oftarget Cumulative fraction of Fragment No. of Possible No. of Possiblefragments of length between fragments bases covered by target covered byFraction of total length L Fragments of Compositions L expected per oflength L or fragments of fragments length L or fragments at (nt) lengthL at length L kilobase of target greater (bases) length L (%) greater(%) length L (%) 1 2 2 62.47 0 6.25 100 16.66 2 8 8 136.68 0.20 27.3493.75 36.45 3 18 18 93.79 1.91 28.14 66.42 25.01 4 33 33 47.84 7.5219.13 38.28 12.76 5 54 54 20.99 23.65 10.49 19.15 5.60 6 82 82 8.3869.21 5.03 8.65 2.24 7 118 118 3.14 200.21 2.19 3.62 0.836 8 163 1631.11 587.16 0.892 1.43 0.297 9 218 218 0.378 1763.0 0.340 0.535 0.101 10284 284 0.126 5367.8 0.126 0.194 0.034 11 362 362 0.041 16790 0.0450.068 0.011 12 453 453 0.013 53805 0.015 0.023 0.0034

TABLE 16B Fragment Length (bases) 4 5 6 b.AAAC.a b.AAAG.m b.AAAAC.ab.AAAAG.m b.AAAAT.v b.AAAAAC.a b.AAAAAG.m b.AAAAAT.v b.AAAACC.a b.AAAT.vb.AACC.a b.AAACC.a b.AAACG.m b.AAACT.v b.AAAACG.m b.AAAACT.v b.AAAAGG.mb.AAAAGT.v b.AACG.m b.AACT.v b.AAAGG.m b.AAAGT.v b.AAATT.v b.AAAATT.vb.AAAGCC.a b.AAACCG.m b.AAACCT.v b.AAGG.m b.AAGT.v b.AACCC.a b.AACCG.mb.AACCT.v b.AAACGG.m b.AAACGT.v b.AAACTT.v b.AAAGGG.m b.AATT.v b.ACCC.ab.AACGG.m b.AACGT.v b.AACTT.v b.AAAGGT.v b.AAAGTT.v b.AAATTT.vb.AACCCC.a b.ACCG.m b.ACCT.v b.AAGGG.m b.AAGGT.v b.AAGTT.v b.AACCCG.mb.AACCCT.v b.AACCGG.m b.AACCGT.v b.ACGG.m b.ACGT.v b.AATTT.v b.ACCCC.ab.ACCCG.m b.AACCTT.v b.AACGGG.m b.AACGGT.v b.AACGTT.v b.ACTT.v b.AGGG.mb.ACCCT.v b.ACCGG.m b.ACCGT.v b.AACTTT.v b.AAGGGG.m b.AACGGT.vb.AAGGTT.v b.AGGT.v b.AGTT.v b.ACCTT.v b.ACGGG.m b.ACGGT.v b.AAGTTT.vb.AATTTT.v b.ACCCCC.a b.ACCCCG.m b.ATTT.v b.CCCC.a b.ACGTT.v b.ACTTT.vb.AGGGG.m b.ACCCCT.v b.ACCCGG.m b.ACCCGT.v b.ACCCTT.v k.CCCG.m k.CCCT.vb.AGGGT.v b.AGGTT.v b.AGTTT.v b.ACCGGG.m b.ACCGGT.v b.ACCGTT.vb.ACCTTT.v k.CCGG.m k.CCGT.v b.ATTTT.v k.CCCCC.a k.CCCCG.m b.ACGGGG.mb.ACGGGT.v b.ACGGTT.v b.ACGTTT.v k.CCTT.v k.CGGG.m k.CCCCT.v k.CCCGG.mk.CCCGT.v b.ACTTTT.v b.AGGGGG.m b.AGGGGT.v b.AGGGTT.v k.CGGT.v k.CGTT.vk.CCCTT.v k.CCGGG.m k.CCGGT.v b.AGGTTT.v b.AGTTTT.v b.ATTTTT.vk.CCCCCC.a k.CTTT.v t.GGGG.m k.CCGTT.v k.CCTTT.v k.CGGGG.m k.CCCCCG.mk.CCCCCT.v k.CCCCGG.m k.CCCCGT.v t.GGGT.v t.GGTT.v k.CGGGT.v k.CGGTT.vk.CGTTT.v k.CCCCTT.v k.CGCGGG.m k.CCCGGT.v k.CCCGTT.v t.GTTT.v k.CTTTT.vt.GGGGG.m t.GGGGT.v k.CCCTTT.v k.CCGGGG.m k.CCGGGT.v k.CCGGTT.vt.GGGTT.v t.GGTTT.v t.GTTTT.v k.CCGTTT.v k.CCTTTT.v k.CGGGGG.mk.CGGGGT.v k.CGGGTT.v k.CGGTTT.v k.CGTTTT.v k.CTTTTT.v t.GGGGGG.mt.GGGGGT.v t.GGGGTT.v t.GGGTTT.v t.GGTTTT.v t.GTTTTT.v

TABLE 17 Nucleo- Mass tide (Da) Description Structure rArCrGrT329.2305.2345.2320.2 adenosine monophosphatecytidinemonophosphateguanosine monophosphatethymidine monophposphate

dAdCdGdT 313.2289.2329.2304.2 2′-deoxyadenosinemonophosphate2′-deoxycytidine monophosphate2′-deoxyguanosinemonophoaphate2′-deoxythymidine monophposphate

nAnCnGnT 312.2288.2328.2303.2 5′-amino-2′,5′-dideoxyadenosinemonophosphate5′-amino-2′,5′-dideoxycytidinemonophosphate5′-amino-2′,5′-dideoxyguanosinemonophosphate5′-amino-2′,5′-dideoxythymidine monophposphate

nrAnrCnrGnrT 328.2304.2344.2319.2 5′-amino-5′-deoxyadenosinemonophosphate5′-amino-5′-deoxycytidinemonophosphate5′-amino-5′-deoxyguanosinemonophosphate5′-amino-5′-deoxythymidine monophposphate

TABLE 18 Multicutter Family _(16/15)[inv(α.α)] Nucleotide_(16/15)[inv(A.A)] _(16/15)[inv(C.C)] _(16/15)[inv(G.G)]_(16/15)[inv(T.T)] A dATP 7-deaza-7-nitro-dATP 7-deaza-7-nitro-dATP7-deaza-7-nitro-dATP C 5-OH-dCTP dCTP 5-OH-dCTP 5-OH-dCTP G7-deaza-7-nitro-dGTP 7-deaza-7-nitro-dGTP dGTP 7-deaza-7-nitro-dGTP T5-OH-dUTP 5-OH-dUTP 5-OH-dUTP dTTP

TABLE 19 Fragment Identity Mappings for Multicutters in Family_(16/15)[inv(α..α)] Fragment Length (nt) Mass (Da) Fragment Mass (Da)Fragment Multicutter: _(16/15)[inv(A.A)] (dATP) Multicutter:_(16/15)[inv(C.C)] (dCTP)  2  724 5′p-AA-3′p  676 5′p-CC-3′p  3 10385′p-AAA-3′p  966 5′p-CCC-3′p  4 1351 5′p-AAAA-3′p 1255 5′p-CCCC-3′p  51664 5′p-AAAAA-3′p 1544 5′p-CCCCC-3′p  6 1977 5′p-AAAAAA-3′p 18335′p-CCCCCC-3′p  7 2290 5′p-AAAAAAA-3′p 2122 5′p-CCCCCCC-3′p  8 26045′p-AAAAAAAA-3′p 2412 5′p-CCCCCCCC-3′p  9 2917 5′p-AAAAAAAAA-3′p 27015′p-CCCCCCCCC-3′p 10 3230 5′p-AAAAAAAAAA-3′p 2990 5′p-CCCCCCCCCC-3′p 113543 5′p-AAAAAAAAAAA-3′p 3279 5′p-CCCCCCCCCCC-3′p 12 38565′p-AAAAAAAAAAAA-3′p 3568 5′p-CCCCCCCCCCCC-3′p 13 41705′p-AAAAAAAAAAAAA-3′p 3858 5′p-CCCCCCCCCCCCC-3′p 14 44835′p-AAAAAAAAAAAAAA-3′p 4147 5′p-CCCCCCCCCCCCCC-3′p 15 47965′p-AAAAAAAAAAAAAAA-3′p 4436 5′p-CCCCCCCCCCCCCCC-3′p 16 51095′p-AAAAAAAAAAAAAAAA-3′p 4725 5′p-CCCCCCCCCCCCCCCC-3′p L L(313.2) + 985′p-(A)_(L)-3′p L(289.2) + 98 5′p-(C)_(L)-3′p Multicutter:_(16/15)[inv(G.G)] (dGTP) Multicutter: _(16/15)[inv(T.T)] (dTTP)  2  7565′p-GG-3′p  706 5′p-TT-3′p  3 1088 5′p-GGG-3′p 1011 5′p-TTT-3′p  4 14155′p-GGGG-3′p 1315 5′p-TTTT-3′p  5 1744 5′p-GGGGG-3′p 1619 5′p-TTTTT-3′p 6 2073 5′p-GGGGGG-3′p 1923 5′p-TTTTTT-3′p  7 2402 5′p-GGGGGGG-3′p 22275′p-TTTTTTT-3′p  8 2732 5′p-GGGGGGGG-3′p 2532 5′p-TTTTTTTT-3′p  9 30615′p-GGGGGGGGG-3′p 2836 5′p-TTTTTTTTT-3′p 10 3390 5′p-GGGGGGGGGG-3′p 31405′p-TTTTTTTTTT-3′p 11 3719 5′p-GGGGGGGGGGG-3′p 3444 5′p-TTTTTTTTTTT-3′p12 4048 5′p-GGGGGGGGGGGG-3′p 3748 5′p-TTTTTTTTTTTT-3′p 13 43785′p-GGGGG6GGGGGGG-3′p 4053 5′p-TTTTTTTTTTTTT-3′p 14 47075′p-GGGGGGGGGGGGGG-3′p 4357 5′p-TTTTTTTTTTTTTT-3′p 15 50365′p-GGGGGGGGGGGGGGG-3′p 4661 5′p-TTTTTTTTTTTTTTT-3′p 16 53655′p-GGGGGGGGGGGGGGGG-3′p 4965 5′p-TTTTTTTTTTTTTTTT-3′p L L(329.2) + 985′p-(G)_(L)-3′p L(304.2) + 98 5′p-(T)_(L)-3′p

TABLE 20 Multicutter Family _(4/3)[inv(α.)] Nucleotide _(4/3)[inv(A.)]_(4/3)[inv(C.)] _(4/3)[inv(G.)] _(4/3)[inv(T.)] A dATP rATP rATP rATP CrCTP dCTP rCTP rCTP G rGTP rGTP dGTP rGTP T rTTP rTTP rTTP dTTP

TABLE 21 Fragment Identity Mappings for Multicutters in Family _(4/3)[α.β. γ.] Fragment Length (nt) Mass (Da) Fragment Mass (Da) FragmentMulticutter: _(4/3)[B.] (dATP, rCTP, rGTP, rTTP) Multicutter:_(4/3)[D.] (rATP, dCTP, rGTP, rTTP) 2  556 5′OH-AC-(2′,3′)OH  5475′OH-CT-(2′,3′)OH 2  571 5′OH-AT-(2′,3′)OH  556 5′OH-CA-(2′,3′)OH 2  5965′OH-AG-(2′,3′)OH  572 5′OH-CG-(2′,3′)OH 3  870 5′OH-AAC-(2′,3′)OH  8375′OH-CCT-(2′,3′)OH 3  885 5′OH-AAT-(2′,3′)OH  846 5′OH-CCA-(2′,3′)OH 3 910 5′OH-AAG-(2′,3′)OH  882 5′OH-CCG-(2′,3′)OH 4 11835′OH-AAAC-(2′,3′)OH 1126 5′OH-CCCT-(2′,3′)OH 4 1198 5′OH-AAAT-(2′,3′)OH1135 5′OH-CCCA-(2′,3′)OH 4 1223 5′OH-AAAG-(2′,3′)OH 11515′OH-CCCG-(2′,3′)OH 5 1496 5′OH-AAAAC-(2′,3′)OH 14155′OH-CCCCT-(2′,3′)OH 5 1511 5′OH-AAAAT-(2′,3′)OH 14245′OH-CCCCA-(2′,3′)OH 5 1536 5′OH-AAAAG-(2′,3′)OH 14405′OH-CCCCG-(2′,3′)OH 6 1809 5′OH-AAAAAC-(2′,3′)OH 17045′OH-CCCCCT-(2′,3′)OH 6 1824 5′OH-AAAAAT-(2′,3′)OH 17135′OH-CCCCCA-(2′,3′)OH 6 1849 5′OH-AAAAAG-(2′,3′)OH 17295′OH-CCCCCG-(2′,3′)OH 7 2122 5′OH-AAAAAAC-(2′,3′)OH 19935′OH-CCCCCCT-(2′,3′)OH 7 2137 5′OH-AAAAAAT-(2′,3′)OH 20025′OH-CCCCCCA-(2′,3′)OH 7 2162 5′OH-AAAAAAG-(2′,3′)OH 20185′OH-CCCCCCG-(2′,3′)OH 8 2436 5′OH-AAAAAAAC-(2′,3′)OH 22835′OH-CCCCCCCT-(2′,3′)OH 8 2451 5′OH-AAAAAAAT-(2′,2′)OH 22925′OH-CCCCCCCA-(2′,3′)OH 8 2476 5′OH-AAAAAAAG-(2′,3′)OH 23085′OH-CCCCCCCG-(2′,3′)OH L (L-1)(313.2) + 5′OH-(A)_((L-1))C-(2′,3′)OH(L-1)(289.2) + 304.2-46 5′OH-(C)_((L-1))T-(2′,3′)OH 289.2-46 L(L-1)(313.2) + 5′OH-(A)_((L-1))T-(2′,3′)OH (L-1)(289.2) + 313.2-465′OH-(C)_((L-1))A-(2′,3′)OH 304.2-46 L (L-1)(313.2) +5′OH-(A)_((L-1))G-(2′,3′)OH (L-1)(289.2) + 329.2-465′OH-(C)_((L-1))G-(2′,3′)OH 329.2.46 Multicutter: _(4/3)[H.] (rATP,rCTP, dGTP, rTTP) Multicutter: _(4/4)[V.] (rATP, rCTP, rGTP, dTTP) 2 572 5′OH-GC-(2′,3′)OH  547 5′OH-TC-(2′,3′)OH 2  587 5′OH-GT-(2′,3′)OH 571 5′OH-TA-(2′,3′)OH 2  596 5′OH-GA-(2′,3′)OH  587 5′OH-TG-(2′,3′)OH 3 902 5′OH-GGC-(2′,3′)OH  852 5′OH-TTC-(2′,3′)OH 3  9175′OH-GGT-(2′,3′)OH  876 5′OH-TTA-(2′,3′)OH 3  926 5′OH-GGA-(2′,3′)OH 892 5′OH-TTG-(2′,3′)OH 4 1231 5′OH-GGGC-(2′,3′)OH 11565′OH-TTTC-(2′,3′)OH 4 1246 5′OH-GGGT-(2′,3′)OH 1180 5′OH-TTTA-(2′,3′)OH4 1255 5′OH-GGGA-(2′,3′)OH 1196 5′OH-TTTG-(2′,3′)OH 5 15805′OH-GGGGC-(3′,3′)OH 1480 5′OH-TTTTC-(2′,3′)OH 5 15755′OH-GGGGT-(2′,3′)OH 1484 5′OH-TTTTA-(2′,3′)OH 5 15845′OH-GGGGA-(2′,3′)OH 1500 5′OH-TTTTG-(2′,3′)OH 6 18895′OH-GGGGGC-(2′,3′)OH 1764 5′OH-TTTTTC-(2′,3′)OH 6 19045′OH-GGGGGT-(2′,3′)OH 1788 5′OH-TTTTTA-(2′,3′)OH 6 19135′OH-GGGGGA-(2′,3′)OH 1804 5′OH-TTTTTG-(2′,3′)OH 7 22185′OH-GGGGGGC-(2′,3′)OH 2068 5′OH-TTTTTTC-(2′,3′)OH 7 22335′OH-GGGGGGT-(2′,3′)OH 2092 5′OH-TTTTTTA-(2′,3′)OH 7 22425′OH-GGGGGGA-(2′,3′)OH 2108 5′OH-TTTTTTG-(2′,3′)OH 8 25485′OH-GGGGGGGC-(2′,3′)OH 2373 5′OH-TTTTTTTC-(2′,3′)OH 8 25635′OH-GGGGGGGT-(2′,3′)OH 2397 5′OH-TTTTTTTA-(2′,3′)OH 8 25725′OH-GGGGGGGA-(2′,3′)OH 2413 5′OH-TTTTTTTG-(2′,3′)OH L (L-1)(329.2) +5′OH-(G)_((L-1))C-(2′,3′)OH (L-1)(304.2) + 289.2-465′OH-(T)_((L-1))C-(2′,3′)OH 289.2-46 L (L-1)(329.2) +5′OH-(G)_((L-1))T-(2′,3′)OH (L-1)(304.2) + 313.2.465′OH-(T)_((L-1))A-(2′,3′)OH 304.2-46 L (L-1)(329.2) +5′OH-(G)_((L-1))A-(2′,3′)OH (L-1)(304.2) + 329.2-465′OH-(T)_((L-1))G-(2′,3′)OH 313.2-46

TABLE 22 Multicutter Family _(16/9)[inv(α.η η.β)] Nucleotide_(16/9)[B.V] _(16/9)[B.H] _(16/9)[B.D] _(16/9)[D.V] _(16/9)[D.H]_(16/9)[D.B] A nATP nATP nATP nrATP nrATP rATP C nrCTP nrCTP rCTP nCTPnCTP nCTP G nrGTP rGTP nrGTP nrGTP rGTP nrGTP T rTTP nrTTP nrTTP rTTPnrTTP nrTTP Nucleotide _(16/9)[H.V] _(16/9)[H.D] _(16/9)[H.B]_(16/9)[V.H] _(16/9)[V.D] _(16/9)[V.B] A nrATP nrATP rATP nrATP nrATPrATP C nrCTP rCTP nrCTP nrCTP rCTP nrCTP G nGTP nGTP nGTP rGTP nrGTPnrGTP T rTTP nrTTP nrTTP nTTP nTTP nTTP

TABLE 23 Fragment Identity Mapping for Multicutter _(16/9)[B.V] (nATP,nrCTP, nrGTP, rTTP) Fragment Length (nt) Mass (Da) Fragment 2  6155′NH₂-AC-2′,3′_(cyc)p 2  623 5′NH₂-CT-2′,3′_(cyc)p 2  6315′NH₂-AT-2′,3′_(cyc)p 2  655 5′NH₂-AG-2′,3′_(cyc)p 2  6635′NH₂-GT-2′,3′_(cyc)p 3  928 5′NH₂-AAC-2′,3′_(cyc)p 3  9365′NH₂-ACT-2′,3′_(cyc)p 3  944 5′NH₂-AAT-2′,3′_(cyc)p 35′NH₂-CTT-2′,3′_(cyc)p 3  952 5′NH₂-ATT˜2′,3′_(cyc)p 3  9685′NH₂-AAG-2′,3′_(cyc)p 3  976 5′NH₂-AGT-2′,3′_(cyc)p 3  9845′NH₂-GTT-2′,3′_(cyc)p 4 1240 5′NH₂-AAAC-2′,3′_(cyc)p 4 12485′NH₂-AACT-2′,3′_(cyc)p 4 1256 5′NH₂-AAAT-2′,3′_(cyc)p 45′NH₂-ACTT-2′,3′_(cyc)p 4 1264 5′NH₂-AATT-2′,3′_(cyc)p 45′NH₂-CTTT-2′,3′_(cyc)p 4 1272 5′NH₂-ATTT-2′,3′_(cyc)p 4 12805′NH₂-AAAG-2′,3′_(cyc)p 4 1288 5′NH₂-AAGT-2′,3′_(cyc)p 4 12965′NH₂-AGTT-2′,3′_(cyc)p 4 1304 5′NH₂-GTTT-2′,3′_(cyc)p 5 15525′NH₂-AAAAC-2′,3′_(cyc)p 5 1560 5′NH₂-AAACT-2′,3′_(cyc)p 5 15685′NH₂-AAAAT-2′,3′_(cyc)p 5 5′NH₂-AACTT-2′,3′_(cyc)p 5 15765′NH₂-AAATT-2′,3′_(cyc)p 5 5′NH₂-ACTTT-2′,3′_(cyc)p 5 15845′NH₂-AATTT-2′,3′_(cyc)p 5 5′NH₂-CTTTT-2′,3′_(cyc)p 5 15925′NH₂-AAAAG-2′,3′_(cyc)p 5 5′NH₂-ATTTT-2′,3′_(cyc)p 5 16005′NH₂-AAAGT-2′,3′_(cyc)p 5 1606 5′NH₂-AAGTT-2′,3′_(cyc)p 5 16165′NH₂-AGTTT-2′,3′_(cyc)p 5 1624 5′NH₂-GTTTT-2′,3′_(cyc)p 6 18645′NH₂-AAAAAC-2′,3′_(cyc)p 6 1872 5′NH₂-AAAACT-2′,3′_(cyc)p 6 18805′NH₂-AAAAAT-2′,3′_(cyc)p 6 5′NH₂-AAACTT-2′,3′_(cyc)p 6 18885′NH₂-AAAATT-2′,3′_(cyc)p 6 5′NH₂-AACTTT-2′,3′_(cyc)p 6 18965′NH₂-AAATTT-2′,3′_(cyc)p 6 5′NH₂-ACTTTT-2′,3′_(cyc)p 6 19045′NH₂-AAAAAG-2′,3′_(cyc)p 6 5′NH₂-AATTTT-2′,3′_(cyc)p 65′NH₂-CTTTTT-2′,3′_(cyc)p 6 1912 5′NH₂-AAAAGT-2′,3′_(cyc)p 65′NH₂-ATTTTT-2′,3′_(cyc)p 6 1920 5′NH₂-AAAGTT-2′,3′_(cyc)p 6 19285′NH₂-AAGTTT-2′,3′_(cyc)p 6 1936 5′NH₂-AGTTTT-2′,3′_(cyc)p 6 19445′NH₂-GTTTTT-2′,3′_(cyc)p 7 2176 5′NH₂-AAAAAAC-2′,3′_(cyc)p 7 21845′NH₂-AAAAACT-2′,3′_(cyc)p 7 2192 5′NH₂-AAAAAAT-2′,3′_(cyc)p 75′NH₂-AAAACTT-2′,3′_(cyc)p 7 2200 5′NH₂-AAAAATT-2′,3′_(cyc)p 75′NH₂-AAACTTT-2′,3′_(cyc)p 7 2208 5′NH₂-AAAATTT-2′,3′_(cyc)p 75′NH₂-AACTTTT-2′,3′_(cyc)p 7 2216 5′NH₂-AAAAAAG-2′,3′_(cyc)p 75′NH₂-AAATTTT-2′,3′_(cyc)p 7 5′NH₂-ACTTTTT-2′,3′_(cyc)p 7 22245′NH₂-AAAAAGT-2′,3′_(cyc)p 7 5′NH₂-AATTTTT-2′,3′_(cyc)p 75′NH₂-CTTTTTT-2′,3′_(cyc)p 7 2232 5′NH₂-AAAAGTT-2′,3′_(cyc)p 75′NH₂-ATTTTTT-2′,3′_(cyc)p 7 2240 5′NH₂-AAAGTTT-2′,3′_(cyc)p 7 22485′NH₂-AAGTTTT-2′,3′_(cyc)p 7 2256 5′NH₂-AGTTTTT-2′,3′_(cyc)p 7 22645′NH₂-GTTTTTT-2′,3′_(cyc)p 8 2489 5′NH₂-AAAAAAAC-2′,3′_(cyc)p 8 24975′NH₂-AAAAAACT-2′,3′_(cyc)p 8 2505 5′NH₂-AAAAAAAT-2′,3′_(cyc)p 85′NH₂-AAAAACTT-2′,3′_(cyc)p 8 2513 5′NH₂-AAAAAATT-2′,3′_(cyc)p 85′NH₂-AAAACTTT-2′,3′_(cyc)p 8 2521 5′NH₂-AAAAATTT-2′,3′_(cyc)p 85′NH₂-AAACTTTT-2′,3′_(cyc)p 8 2529 5′NH₂-AAAAAAAG-2′,3′_(cyc)p 85′NH₂-AAAATTTT-2′,3′_(cyc)p 8 5′NH₂-AACTTTTT-2′,3′_(cyc)p 8 25375′NH₂-AAAAAAGT-2′,3′_(cyc)p 8 5′NH₂-AAATTTTT-2′,3′_(cyc)p 85′NH₂-ACTTTTTT-2′,3′_(cyc)p 8 2545 5′NH₂-AAAAAGTT-2′,3′_(cyc)p 85′NH₂-AATTTTTT-2′,3′_(cyc)p 8 5′NH₂-CTTTTTTT-2′,3′_(cyc)p

TABLE 24 Fragment Identity Mapping for Multicutter _(16/9)[B.H] (nATP,nrCTP, rGTP, nrTTP) Fragment Length (nt) Mass (Da) Fragment  2  6155′NH₂-AC-2′,3′_(cyc)p  2  630 5′NH₂-AT-2′,3′_(cyc)p  2  6485′NH₂-CG-2′,3′_(cyc)p  2  656 5′NH₂-AG-2′,3′_(cyc)p  2  6635′NH₂-TG-2′,3′_(cyc)p  3  928 5′NH₂-AAC-2′,3′_(cyc)p  3  9435′NH₂-AAT-2′,3′_(cyc)p  3  961 5′NH₂-ACG-2′,3′_(cyc)p  3  9695′NH₂-AAG-2′,3′_(cyc)p  3  976 5′NH₂-ATG-2′,3′_(cyc)p  3  9945′NH₂-CGG-2′,3′_(cyc)p  3 1002 5′NH₂-AGG-2′,3′_(cyc)p  3 10095′NH₂-TGG-2′,3′_(cyc)p  4 1240 5′NH₂-AAAC-2′,3′_(cyc)p  4 12555′NH₂-AAAT-2′,3′_(cyc)p  4 1273 5′NH₂-AACG-2′,3′_(cyc)p  4 12815′NH₂-AAAG-2′,3′_(cyc)p  4 1288 5′NH₂-AATG-2′,3′_(cyc)p  4 13065′NH₂-ACGG-2′,3′_(cyc)p  4 1314 5′NH₂-AAGG-2′,3′_(cyc)p  4 13215′NH₂-ATGG-2′,3′_(cyc)p  4 1339 5′NH₂-CGGG-2′,3′_(cyc)p  4 13475′NH₂-AGGG-2′,3′_(cyc)p  4 1354 5′NH₂-TGGG-2′,3′_(cyc)p  5 15525′NH₂-AAAAC-2′,3′_(cyc)p  5 1567 5′NH₂-AAAAT-2′,3′_(cyc)p  5 15855′NH₂-AAACG-2′,3′_(cyc)p  5 1593 5′NH₂-AAAAG-2′,3′_(cyc)p  5 16005′NH₂-AAATG-2′,3′_(cyc)p  5 1618 5′NH₂-AACGG-2′,3′_(cyc)p  5 16265′NH₂-AAAGG-2′,3′_(cyc)p  5 1633 5′NH₂-AATGG-2′,3′_(cyc)p  5 16515′NH₂-ACGGG-2′,3′_(cyc)p  5 1659 5′NH₂-AAGGG-2′,3′_(cyc)p  5 16665′NH₂-ATGGG-2′,3′_(cyc)p  5 1684 5′NH₂-CGGGG-2′,3′_(cyc)p  5 16925′NH₂-AGGGG-2′,3′_(cyc)p  5 1699 5′NH₂-TGGGG-2′,3′_(cyc)p  6 18645′NH₂-AAAAAC-2′,3′_(cyc)p  6 1879 5′NH₂-AAAAAT-2′,3′_(cyc)p  6 18975′NH₂-AAAACG-2′,3′_(cyc)p  6 1905 5′NH₂-AAAAAG-2′,3′_(cyc)p  6 19125′NH₂-AAAATG-2′,3′_(cyc)p  6 1930 5′NH₂-AAACGG-2′,3′_(cyc)p  6 19385′NH₂-AAAAGG-2′,3′_(cyc)p  6 1945 5′NH₂-AAATGG-2′,3′_(cyc)p  6 19635′NH₂-AACGGG-2′,3′_(cyc)p  6 1971 5′NH₂-AAAGGG-2′,3′_(cyc)p  6 19785′NH₂-AATGGG-2′,3′_(cyc)p  6 1996 5′NH₂-ACGGGG-2′,3′_(cyc)p  6 20045′NH₂-AAGGGG-2′,3′_(cyc)p  6 2011 5′NH₂-ATGGGG-2′,3′_(cyc)p  6 20295′NH₂-CGGGGG-2′,3′_(cyc)p  6 2037 5′NH₂-AGGGGG-2′,3′_(cyc)p  6 20445′NH₂-TGGGGG-2′,3′_(cyc)p  7 2176 5′NH₂-AAAAAAC-2′,3′_(cyc)p  7 21915′NH₂-AAAAAAT-2′,3′_(cyc)p  7 2209 5′NH₂-AAAAACG-2′,3′_(cyc)p  7 22175′NH₂-AAAAAAG-2′,3′_(cyc)p  7 2224 5′NH₂-AAAAATG-2′,3′_(cyc)p  7 22425′NH₂-AAAACGG-2′,3′_(cyc)p  7 2250 5′NH₂-AAAAAGG-2′,3′_(cyc)p  7 22575′NH₂-AAAATGG-2′,3′_(cyc)p  7 2275 5′NH₂-AAACGGG-2′,3′_(cyc)p  7 22835′NH₂-AAAAGGG-2′,3′_(cyc)p  7 2290 5′NH₂-AAATGGG-2′,3′_(cyc)p  7 23085′NH₂-AACGGGG-2′,3′_(cyc)p  7 2316 5′NH₂-AAAGGGG-2′,3′_(cyc)p  7 23235′NH₂-AATGGGG-2′,3′_(cyc)p  7 2341 5′NH₂-ACGGGGG-2′,3′_(cyc)p  7 23495′NH₂-AAGGGGG-2′,3′_(cyc)p  7 2356 5′NH₂-ATGGGGG-2′,3′_(cyc)p  7 23745′NH₂-CGGGGGG-2′,3′_(cyc)p  7 2382 5′NH₂-AGGGGGG-2′,3′_(cyc)p  7 23895′NH₂-TGGGGGG-2′,3′_(cyc)p  8 2489 5′NH₂-AAAAAAAC-2′,3′_(cyc)p  8 25045′NH₂-AAAAAAAT-2′,3′_(cyc)p  8 2522 5′NH₂-AAAAAACG-2′,3′_(cyc)p  8 25305′NH₂-AAAAAAAG-2′,3′_(cyc)p  8 2537 5′NH₂-AAAAAATG-2′,3′_(cyc)p  8 25555′NH₂-AAAAACGG-2′,3′_(cyc)p  8 2563 5′NH₂-AAAAAAGG-2′,3′_(cyc)p  8 25705′NH₂-AAAAATGG-2′,3′_(cyc)p  8 2588 5′NH₂-AAAACGGG-2′,3′_(cyc)p  8 25965′NH₂-AAAAAGGG-2′,3′_(cyc)p  8 2603 5′NH₂-AAAATGGG-2′,3′_(cyc)p  8 26215′NH₂-AAACGGGG-2′,3′_(cyc)p  8 2629 5′NH₂-AAAAGGGG-2′,3′_(cyc)p  8 26365′NH₂-AAATGGGG-2′,3′_(cyc)p  8 2654 5′NH₂-AACGGGGG-2′,3′_(cyc)p  8 26625′NH₂-AAAGGGGG-2′,3′_(cyc)p  8 2669 5′NH₂-AATGGGGG-2′,3′_(cyc)p  8 26675′NH₂-ACGGGGGG-2′,3′_(cyc)p  8 2695 5′NH₂-AAGGGGGG-2′,3′_(cyc)p  8 27025′NH₂-ATGGGGGG-2′,3′_(cyc)p  8 2720 5′NH₂-CGGGGGGG-2′,3′_(cyc)p  8 27285′NH₂-AGGGGGGG-2′,3′_(cyc)p  8 2735 5′NH₂-TGGGGGGG-2′,3′_(cyc)p  9 28015′NH₂-AAAAAAAAC-2′,3′_(cyc)p  9 2816 5′NH₂-AAAAAAAAT-2′,3′_(cyc)p  92834 5′NH₂-AAAAAAACG-2′,3′_(cyc)p  9 2842 5′NH₂-AAAAAAAAG-2′,3′_(cyc)p 9 2849 5′NH₂-AAAAAAATG-2′,3′_(cyc)p  9 28675′NH₂-AAAAAACGG-2′,3′_(cyc)p  9 2875 5′NH₂-AAAAAAAGG-2′,3′_(cyc)p  92882 5′NH₂-AAAAAATGG-2′,3′_(cyc)p  9 2900 5′NH₂-AAAAACGGG-2′,3′_(cyc)p 9 2908 5′NH₂-AAAAAAGGG-2′,3′_(cyc)p  9 29155′NH₂-AAAAATGGG-2′,3′_(cyc)p  9 2933 5′NH₂-AAAACGGGG-2′,3′_(cyc)p  92941 5′NH₂-AAAAAGGGG-2′,3′_(cyc)p  9 2948 5′NH₂-AAAATGGGG-2′,3′_(cyc)p 9 2966 5′NH₂-AAACGGGGG-2′,3′_(cyc)p  9 29745′NH₂-AAAAGGGGG-2′,3′_(cyc)p  9 2981 5′NH₂-AAATGGGGG-2′,3′_(cyc)p  92999 5′NH₂-AACGGGGGG-2′,3′_(cyc)p  9 3007 5′NH₂-AAAGGGGGG-2′,3′_(cyc)p 9 3014 5′NH₂-AATGGGGGG-2′,3′_(cyc)p  9 30325′NH₂-ACGGGGGGG-2′,3′_(cyc)p  9 3040 5′NH₂-AAGGGGGGG-2′,3′_(cyc)p  93047 5′NH₂-ATGGGGGGG-2′,3′_(cyc)p  9 3065 5′NH₂-CGGGGGGGG-2′,3′_(cyc)p 9 3073 5′NH₂-AGGGGGGGG-2′,3′_(cyc)p  9 30805′NH₂-TGGGGGGGG-2′,3′_(cyc)p 10 3113 5′NH₂-AAAAAAAAAC-2′,3′_(cyc)p 103128 5′NH₂-AAAAAAAAAT-2′,3′_(cyc)p 10 3146 5′NH₂-AAAAAAAACG-2′,3′_(cyc)p10 3154 5′NH₂-AAAAAAAAAG-2′,3′_(cyc)p 10 31615′NH₂-AAAAAAAATG-2′,3′_(cyc)p 10 3179 5′NH₂-AAAAAAACGG-2′,3′_(cyc)p 103187 5′NH₂-AAAAAAAAGG-2′,3′_(cyc)p 10 3194 5′NH₂-AAAAAAATGG-2′,3′_(cyc)p10 3212 5′NH₂-AAAAAACGGG-2′,3′_(cyc)p 10 32205′NH₂-AAAAAAAGGG-2′,3′_(cyc)p 10 3227 5′NH₂-AAAAAATGGG-2′,3′_(cyc)p 103245 5′NH₂-AAAAACGGGG-2′,3′_(cyc)p 10 3253 5′NH₂-AAAAAAGGGG-2′,3′_(cyc)p10 3260 5′NH₂-AAAAATGGGG-2′,3′_(cyc)p 10 32785′NH₂-AAAACGGGGG-2′,3′_(cyc)p 10 3286 5′NH₂-AAAAAGGGGG-2′,3′_(cyc)p 103293 5′NH₂-AAAATGGGGG-2′,3′_(cyc)p 10 3311 5′NH₂-AAACGGGGGG-2′,3′_(cyc)p10 3319 5′NH₂-AAAAGGGGGG-2′,3′_(cyc)p 10 33265′NH₂-AAATGGGGGG-2′,3′_(cyc)p 10 3344 5′NH₂-AACGGGGGGG-2′,3′_(cyc)p 103352 5′NH₂-AAAGGGGGGG-2′,3′_(cyc)p 10 3359 5′NH₂-AATGGGGGGG-2′,3′_(cyc)p10 3377 5′NH₂-ACGGGGGGGG-2′,3′_(cyc)p 10 33855′NH₂-AAGGGGGGGG-2′,3′_(cyc)p 10 3392 5′NH₂-ATGGGGGGGG-2′,3′_(cyc)p 103410 5′NH₂-CGGGGGGGGG-2′,3′_(cyc)p 10 3418 5′NH₂-AGGGGGGGGG-2′,3′_(cyc)p10 3425 5′NH₂-TGGGGGGGGG-2′,3′_(cyc)p 11 5′NH₂-AAAAAAAAAAC-2′,3′_(cyc)p11 3440 5′NH₂-AAAAAAAAAAT-2′,3′_(cyc)p 11 34585′NH₂-AAAAAAAAACG-2′,3′_(cyc)p 11 3466 5′NH₂-AAAAAAAAAAG-2′,3′_(cyc)p 113473 5′NH₂-AAAAAAAAATG-2′,3′_(cyc)p 11 34915′NH₂-AAAAAAAACGG-2′,3′_(cyc)p 11 3499 5′NH₂-AAAAAAAAAGG-2′,3′_(cyc)p 113506 5′NH₂-AAAAAAAATGG-2′,3′_(cyc)p 11 35245′NH₂-AAAAAAACGGG-2′,3′_(cyc)p 11 3532 5′NH₂-AAAAAAAAGGG-2′,3′_(cyc)p 113539 5′NH₂-AAAAAAATGGG-2′,3′_(cyc)p 11 35575′NH₂-AAAAAACGGGG-2′,3′_(cyc)p 11 3565 5′NH₂-AAAAAAAGGGG-2′,3′_(cyc)p 113572 5′NH₂-AAAAAATGGGG-2′,3′_(cyc)p 11 35905′NH₂-AAAAACGGGGG-2′,3′_(cyc)p 11 3598 5′NH₂-AAAAAAGGGGG-2′,3′_(cyc)p 113605 5′NH₂-AAAAATGGGGG-2′,3′_(cyc)p 11 36235′NH₂-AAAACGGGGGG-2′,3′_(cyc)p 11 3631 5′NH₂-AAAAAGGGGGG-2′,3′_(cyc)p 113638 5′NH₂-AAAATGGGGGG-2′,3′_(cyc)p 11 36565′NH₂-AAACGGGGGGG-2′,3′_(cyc)p 11 3664 5′NH₂-AAAAGGGGGGG-2′,3′_(cyc)p 113671 5′NH₂-AAATGGGGGGG-2′,3′_(cyc)p 11 36895′NH₂-AACGGGGGGGG-2′,3′_(cyc)p 11 3697 5′NH₂-AAAGGGGGGGG-2′,3′_(cyc)p 113704 5′NH₂-AATGGGGGGGG-2′,3′_(cyc)p 11 37225′NH₂-ACGGGGGGGGG-2′,3′_(cyc)p 11 3730 5′NH₂-AAGGGGGGGGG-2′,3′_(cyc)p 113737 5′NH₂-ATGGGGGGGGG-2′,3′_(cyc)p 12 5′NH₂-AAAAAAAAAAAC-2′,3′_(cyc)p12 3752 5′NH₂-AAAAAAAAAAAT-2′,3′_(cyc)p 11 37555′NH₂-CGGGGGGGGGG-2′,3′_(cyc)p 11 3763 5′NH₂-AGGGGGGGGGG-2′,3′_(cyc)p 113770 5′NH₂-TGGGGGGGGGG-2′,3′_(cyc)p 12 5′NH₂-AAAAAAAAAACG-2′,3′_(cyc)p12 3778 5′NH₂-AAAAAAAAAAAG-2′,3′_(cyc)p 12 37855′NH₂-AAAAAAAAAATG-2′,3′_(cyc)p 12 3803 5′NH₂-AAAAAAAAACGG-2′,3′_(cyc)p12 3811 5′NH₂-AAAAAAAAAAGG-2′,3′_(cyc)p 12 38185′NH₂-AAAAAAAAATGG-2′,3′_(cyc)p 12 3836 5′NH₂-AAAAAAAACGGG-2′,3′_(cyc)p12 3844 5′NH₂-AAAAAAAAAGGG-2′,3′_(cyc)p 12 38515′NH₂-AAAAAAAATGGG-2′,3′_(cyc)p 12 3869 5′NH₂-AAAAAAACGGGG-2′,3′_(cyc)p12 3877 5′NH₂-AAAAAAAAGGGG-2′,3′_(cyc)p 12 38845′NH₂-AAAAAAATGGGG-2′,3′_(cyc)p 12 3902 5′NH₂-AAAAAACGGGGG-2′,3′_(cyc)p12 3910 5′NH₂-AAAAAAAGGGGG-2′,3′_(cyc)p 12 39175′NH₂-AAAAAATGGGGG-2′,3′_(cyc)p 12 3935 5′NH₂-AAAAACGGGGGG-2′,3′_(cyc)p12 3943 5′NH₂-AAAAAAGGGGGG-2′,3′_(cyc)p 12 39505′NH₂-AAAAATGGGGGG-2′,3′_(cyc)p 12 3968 5′NH₂-AAAACGGGGGGG-2′,3′_(cyc)p12 3976 5′NH₂-AAAAAGGGGGGG-2′,3′_(cyc)p 12 39835′NH₂-AAAATGGGGGGG-2′,3′_(cyc)p 12 4001 5′NH₂-AAACGGGGGGGG-2′,3′_(cyc)p12 4009 5′NH₂-AAAAGGGGGGGG-2′,3′_(cyc)p

TABLE 25 EMBL Accession No. AJ536038 AJ536037 AJ536040 AJ536039 AJ536042AJ536036 Species Mycobacterium Mycobacterium Mycobacterium MycobacteriumMycobacterium Mycobacterium abscessus avium subsp. celarum fortuitumgordonae intracel- avium subsp. lulare fortuitum Sequence Length (nt)1455 1472 1426 1457 1072 1440 Multicutter _(4/3)[inv(A.)] aaac aaac aaacaaac aaac aaac dATP, rCTP, rGTP, rTTP aaag aaag aaag aaag aaag aaag —AAAT AAAT — AAAT AAAT aaaac aaaac aaaac aaaac aaaac aaaac — — — — — —Multicutter _(4/3)[inv(C.)] ccca ccca ccca ccca ccca ccca rATP, dCTP,rGTP, rTTP cccg cccg cccg cccg cccg cccg ccct ccct ccct ccct ccct ccctCCCCA CCCCA — — — CCCCA ccccg ccccg ccccg ccccg ccccg ccccg CCCCT CCCCTCCCCT CCCCT — CCCCT Multicutter _(4/3)[inv(G,)] ggga ggga ggga ggga gggaggga rATP, rCTP, dGTP, rTTP gggc gggc gggc gggc gggc gggc gggt gggt gggtgggt gggt gggt gggga gggga gggga gggga gggga gggga — — — — — — gggtggggt ggggt ggggt ggggt ggggt — GGGGGA GGGGGA — GGGGGA GGGGGA gggggcgggggc gggggc gggggc gggggc gggggc — — GGGGGT — — GGGGGT Multicutter_(4/3)[inv(T.)] — — TTTA — — TTTA rATP, rCTP, rGTP, dTTP tttc tttc tttctttc tttc tttc tttg tttg tttg tttg tttg tttg — TTTTA — — — TTTTA — TTTTCTTTTC — TTTTC TTTTC TTTTG TTTTG TTTTG TTTTG TTTTG TTTTG — — TTTTTG — — —— — — — — — EMBL Accession No. AJ536035 AJ536032 AJ536034 AJ536041AJ536031 AJ536033 Species Mycobacterium Mycobacterium MycobacteriumMycobacterium Mycobacterium Mycobacterium kansasii marinum scrofulaceumsmegmatis tuberculosis xenopi Sequence Length (nt) 1470 1410 1467 14611471 1480 Multicutter _(4/3)[inv(A.)] aaac aaac aaac aaac aaac aaacdATP, rCTP, rGTP, rTTP aaag aaag aaag aaag aaag aaag AAAT AAAT AAAT —AAAT — aaaac aaaac aaaac aaaac aaaac aaaac — — — — AAAAG — Multicutter_(4/3)[inv(C.)] ccca ccca ccca ccca ccca ccca rATP, dCTP, rGTP, rTTPcccg cccg cccg cccg cccg cccg ccct ccct ccct ccct ccct ccct — — — — — —ccccg ccccg ccccg ccccg ccccg ccccg CCCCT CCCCT CCCCT CCCCT CCCCT CCCCTMulticutter _(4/3)[inv(G,)] ggga ggga ggga ggga ggga ggga rATP, rCTP,dGTP, rTTP gggc gggc gggc gggc gggc gggc gggt gggt gggt gggt gggt gggtgggga gggga gggga gggga gggga gggga — — — — — GGGGC ggggt ggggt ggggtggggt ggggt ggggt GGGGGA GGGGGA GGGGGA — GGGGGA GGGGGA gggggc gggggcgggggc gggggc gggggc gggggc — — GGGGGT GGGGGT — — Multicutter_(4/3)[inv(T.)] — TTTA — — TTTA — rATP, rCTP, rGTP, dTTP tttc tttc tttctttc tttc tttc tttg tttg tttg tttg tttg tttg TTTTA — TTTTA — — — — TTTTC— — — TTTTC TTTTG TTTTG TTTTG TTTTG — — — — — — — — — — — — — TTTTTTG

1. A method of determining a target sequence of a template nucleic acidcomprising the steps of: a) creating a transcript of an isolatedtemplate nucleic acid using polymerase enzyme, nucleosides selected forsequence specific reactivity and molecular weight and oligonucleotideprimers; b) performing a cleavage reaction resulting in completecleavage of the transcript in a sequence-specific manner into fragmentsusing cutters selected from the group consisting of enzymatic cutters,chemical cutters, and a combination thereof; c) analyzing the cleavagereaction products to determine the molecular weights of the fragments;d) performing fragment identity mapping using nucleotide masses andcleavage specificities of the cutters to calculate the molecular weightsand sequences of all possible fragments that result from step b)cleavage reactions; and e) comparing the molecular weights of thefragments observed in step c) with the fragment identity mapping of stepd), wherein the comparison results in determination of all the targetsequences present in the sample.
 2. The method of claim 1, wherein theoligonucleotide primers are sequence specific.
 3. The method of claim 1,wherein the oligonucleotide primers have a random sequence.
 4. Themethod of claim 1, wherein the molecular weight is determined using massspectroscopy.
 5. The method of claim 4, wherein mass spectroscopy ismatrix-assisted laser desorption/ionization time-of-flight spectroscopy.6. A method of determining the number of genes in a nucleic acid sample,comprising the steps of identifying any poly-A tails in the nucleic acidsample by the method of claim 4, wherein the digestion is performedusing single-nucleotide cutters destroying all other nucleotides in thesample except for the poly-A containing fragments, and further analyzingthe number of the poly-A containing fragments by analyzing the size ofthe peak from mass spectroscopy, wherein the peak size is indicative ofthe number of fragments comprising a poly-A tail.
 7. A method ofidentifying the amount of a known nucleic acid sequence in a biologicalsample comprising the steps of selecting a unique sequence in the knownnucleic acid sequence, selecting a nucleic acid cutter capable ofdigesting the nucleic acid sample containing a known nucleic acidsequence, transcribing the nucleic acid sample using random primers,digesting the transcript with the sequence-specific cutter to obtainfragments, analyzing the molecular weight of the fragments using massspectroscopy, and determining the number of fragments in the sample bycomparing the peak size from the digested sample to a peak size of asample of the known sequence, wherein the comparison results inidentification of the amount of the known nucleic acid sequence in thebiological sample.